Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras

Abstract.

Multi-Task Learning (MTL) is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to Single-Task Learning (STL), MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL’s key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the past twenty years, MTL has become widely recognized as a flexible and effective approach in various fields, including computer vision, natural language processing, recommendation systems, disease prognosis and diagnosis, and robotics. This survey provides a comprehensive overview of the evolution of MTL, encompassing the technical aspects of cutting-edge methods from traditional approaches to deep learning and the latest trend of pretrained foundation models. Our survey methodically categorizes MTL techniques into five key areas: regularization, relationship learning, feature propagation, optimization, and pre-training. This categorization not only chronologically outlines the development of MTL but also dives into various specialized strategies within each category. Furthermore, the survey reveals how the MTL evolves from handling a fixed set of tasks to embracing a more flexible approach free from task or modality constraints. It explores the concepts of task-promptable and -agnostic training, along with the capacity for zero-shot learning, which unleashes the untapped potential of this historically coveted learning paradigm. Overall, we hope this survey provides the research community with a comprehensive overview of the advancements in MTL from its inception in 1997 to the present in 2023. We address present challenges and look ahead to future possibilities, shedding light on the opportunities and potential avenues for MTL research in a broad manner. This project is publicly available at https://github.com/junfish/Awesome-Multitask-Learning.

Jun Yu\upstairs\affilone\affiltwo, {\dagger}, {\ddagger}, Yutong Dai\upstairs\affilthree, Xiaokang Liu\upstairs\affiltwo\affilfour, Jin Huang\upstairs\affilfive, Yishan Shen\upstairs\affiltwo, Ke Zhang\upstairs\affilsix,
Rong Zhou\upstairs\affilone, Eashan Adhikarla\upstairs\affilone, Wenxuan Ye\upstairs\affilone, Yixin Liu\upstairs\affilone, Zhaoming Kong\upstairs\affilseven, Kai Zhang\upstairs\affilone,
Yilong Yin\upstairs\affilfive, Vinod Namboodiri\upstairs\affilone\affileight, Brian D. Davison\upstairs\affilone, Jason H. Moore\upstairs\affilnine, Yong Chen\upstairs\affiltwo, {\ddagger}
\upstairs\affilone Department of Computer Science and Engineering, Lehigh University, USA
\upstairs\affiltwo Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, USA
\upstairs\affilthree Department of Industrial and Systems Engineering, Lehigh University, USA
\upstairs\affilfour Department of Statistics, University of Missouri, USA
\upstairs\affilfive School of Software, Shandong University, China
\upstairs\affilsix Department of Computer Science, University of Hong Kong, China
\upstairs\affilseven Department of Computer Science and Engineering, South China University of Technology, China
\upstairs\affileight Department of Community and Population Health, Lehigh University, USA
\upstairs\affilnine Department of Computational Biomedicine, Cedars-Sinai Medical Center, USA
\emails\upstairs

{\dagger}This work includes efforts as a visiting student at Upenn.

\upstairs

{\ddagger}Corresponding to [email protected] oder [email protected].

Refer to caption
Figure 1. Significant landmarks in the evolution of Multi-Task Learning (MTL) highlighted over time.

Keywords: Deep Learning, Generative Pretrained Transformers, Multi-Objective Optimization, Multi-Task Learning, Pretrained Foundation Models, Prompt Learning

\copyrightnotice
Refer to caption
Figure 2. The structure of this survey.

1. Introduction

In the introduction, we hope to answer the following five research questions (RQs) before we overview the methodologies of Multi-task Learning (MTL):

  • RQ1: What is the concept and definition of MTL? (See § 1.1)

  • RQ2: How does MTL distinguish itself from other learning paradigms? (See § 1.2)

  • RQ3: What motivates the use of MTL in learning scenarios? (See § 1.3)

  • RQ4: What underlying principles does the efficacy of MTL rest on? (See § 1.4)

  • RQ5: In what ways does our survey differentiate from previous studies? (See § 1.5)

In § 1.1, we progressively introduce Multi-Task Learning (MTL), starting with a broad sense and culminating in a formal definition. Subsequently, § 1.2 explores the position of MTL within the Machine Learning (ML) landscape, drawing comparisons with related paradigms such as Transfer Learning (TL), Few-Shot Learning (FSL), lifelong learning, Multi-View Learning (MVL), to name a few. § 1.3 delves into the motivations for employing MTL, offering insights from both explicit and subtle angles, while also addressing how MTL benefits the involved tasks. In § 1.4, we delve deeper into the fundamental mechanisms and theories underpinning MTL, specifically: 1) regularization, 2) inductive bias, and 3) feature sharing, providing an understanding of its underlying principles. Finally, § 1.5 reviews existing surveys on MTL, underscoring the unique contributions of our survey and laying out a structured roadmap for the remainder of this work. The structure of our survey is depicted in Fig. 2. Before delving into this survey, readers can quickly refer to Table 1 for a list of acronyms not related to datasets, institutions, and newly proposed methods, while an overview of mathematical notations is provided in Table 3 and Table 6.

Table 1. Alphabetically sorted index table of acronyms.
Abbreviation Expanded Form Abbreviation Expanded Form
AD Alzheimer’s Disease AGM Accelerated Gradient Method
APM Accelerated Proximal Method CE Cross-Entropy
CNN Convolutional Neural Network CT Computed Tomography
CV Computer Vision DA Domain Adaptation
DL Deep Learning DNN Deep Neural Network
FCN Fully Convolutional Network FNN Feedforward Neural Network
FSL Few Shot Learning GAN Generative Adversarial Network
GCN Graph Convolutional Network GNN Graph Neural Network
GP Gaussian Process GPT Generative Pretrained Transformer
GPU Graphics Processing Unit GRL Gradient Reversal Layer
I/O Input/Output KD Knowledge Distillation
LLM Large Language Model LSTM Long Short-Term Memory
MAP Maximum A Posteriori MCI Mild Cognitive Impairment
MDP Markov Decision Process MIM Masked Image Modeling
MIML Multi-Instance Multi-Label learning MIMO Multi-Input Multi-Output
MISO Multi-Input Single-Output ML Machine Learning
MLM Masked Language Modeling MLP Multi-Layer Perceptron
MoE Mixture-of-Experts MOO Multi-Objective Optimization
MRI Magnetic Resonance Imaging MSE Mean Squared Error
MTL Multi-Task Learning MTRL Multi-Task Reinforcement Learning
MVL Multi-View Learning NAS Neural Architecture Search
NLI Natural Language Inference NLP Natural Language Processing
OCR Optical Character Recognition OOD Out-Of-Distribution
PET Positron Emission Tomography PFM Pretrained Foundation Model
PSD Positive Semi-Definite RL Reinforcement Learning
RNN Recurrent Neural Network seq2seq sequence to sequence
SIMO Single-Input Multi-Output SNP Single Nucleotide Polymorphism
SGD Stochastic Gradient Descent SSL Self-Supervised Learning
SOTA State-Of-The-Art STL Single-Task Learning
SVD Singular Value Decomposition SVM Support Vector Machine
TL Transfer Learning TPU Tensor Processing Unit
VLM Vision-Language Model VQA Visual Question Answering
ZSL Zero-Shot Learning

This table excludes abbreviations pertaining to datasets, institutions, and newly proposed methods.

1.1. Definition

Refer to caption
Figure 3. The total number of published papers (y𝑦yitalic_y-axis) has surged for the MTL topic from 1997 to 2023 (x𝑥xitalic_x-axis).

The increasing popularity of MTL over the past few decades is evident in Fig. 3, which displays the trend in the number of papers associated with “allintitle: ‘multitask learning’ OR ‘multi-task learning’ ” as a keyword search, according to data from Google Scholar111https://scholar.google.com.

As the name suggests, MTL is a subfield of ML where multiple tasks are jointly learned. In this manner, we hope to leverage useful information across these related tasks and break from the tradition of performing different tasks in isolation. In Single-Task Learning (STL), data specific to the task at hand is the only source to couch a learner. However, MTL can conveniently transfer extra knowledge learned from other tasks. The essence of MTL is to exploit consensual and complementary information among tasks by combining data resources and sharing knowledge. This sheds light on a better learning paradigm that can reduce memory burden and data consumption, and improve training speed and testing performance. For instance, learning the monocular depth estimation (scaling the distance to the camera) (eigen2014depth) and semantic segmentation (assigning a class label to every pixel value) (fu1981survey) simultaneously in images is beneficial since both tasks need to perceive meaningful objects. MTL has become increasingly ubiquitous as experimental and theoretical analyses continue to validate its promising results. For example, using Face ID to unlock an iPhone is a typical but imperceptible MTL application that involves simultaneously locating the user’s face and identifying the user. In general, multitasking occurs when we attempt to handle two or more objectives during the optimization stage in practice.

Consequently, MTL exists everywhere in ML, even when performing STL with regularization. This can be understood as having one target task and an additional artificial task of human preference, such as learning a constrained model via 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularizer or a parsimonious model via 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularizer. These hypothesis preferences can serve as an inductive bias to enhance an inductive learner (caruna1993multitask). In the early exploration of MTL (caruana1997multitask), the extra information that the involved tasks provide is regarded as a domain-specific inductive bias for the other tasks. Since collecting training signals from other tasks is more practical than acquiring inductive bias from model design or human expertise, we can thus empower any ML models via this MTL paradigm.

1.1.1. Formal Definition

To comprehensively understand MTL, we provide a formal definition of MTL. Suppose we have a sample dataset 𝑿𝑿\boldsymbol{X}bold_italic_X drawn from the feature space 𝒳𝒳{\mathcal{X}}caligraphic_X, and its respective ground-truth label set 𝒀𝒀\boldsymbol{Y}bold_italic_Y drawn from the label space 𝒴𝒴{\mathcal{Y}}caligraphic_Y. We can define experience {𝑿,𝒀}𝑿𝒀{\mathcal{E}}\subseteq\{\boldsymbol{X},\boldsymbol{Y}\}caligraphic_E ⊆ { bold_italic_X , bold_italic_Y }, domain 𝒟=(𝒳,P(𝑿))𝒟𝒳𝑃𝑿{\mathcal{D}}=({\mathcal{X}},P(\boldsymbol{X}))caligraphic_D = ( caligraphic_X , italic_P ( bold_italic_X ) ), and task 𝒯=(𝒴,f)𝒯𝒴𝑓{\mathcal{T}}=({\mathcal{Y}},f)caligraphic_T = ( caligraphic_Y , italic_f ), where P(𝑿)𝑃𝑿P(\boldsymbol{X})italic_P ( bold_italic_X ) is the distribution of 𝑿𝑿\boldsymbol{X}bold_italic_X and f𝑓fitalic_f maps a data sample 𝒙𝑿𝒙𝑿\boldsymbol{x}\in\boldsymbol{X}bold_italic_x ∈ bold_italic_X to a prediction 𝒚~𝒀~𝒚𝒀\tilde{\boldsymbol{y}}\in\boldsymbol{Y}over~ start_ARG bold_italic_y end_ARG ∈ bold_italic_Y. These predictive values consist of the predictive label set 𝒀~={𝒚~|𝒚~=f(𝒙),𝒙𝑿}~𝒀conditional-set~𝒚formulae-sequence~𝒚𝑓𝒙𝒙𝑿\tilde{\boldsymbol{Y}}=\{\tilde{\boldsymbol{y}}|\tilde{\boldsymbol{y}}=f(% \boldsymbol{x}),\boldsymbol{x}\in\boldsymbol{X}\}over~ start_ARG bold_italic_Y end_ARG = { over~ start_ARG bold_italic_y end_ARG | over~ start_ARG bold_italic_y end_ARG = italic_f ( bold_italic_x ) , bold_italic_x ∈ bold_italic_X }. Following the ML settings, we should define a measurement 𝒫=(𝒀,𝒀~,)𝒫𝒀~𝒀{\mathcal{P}}=(\boldsymbol{Y},\tilde{\boldsymbol{Y}},{\mathcal{L}})caligraphic_P = ( bold_italic_Y , over~ start_ARG bold_italic_Y end_ARG , caligraphic_L ), where {\mathcal{L}}caligraphic_L is a function to measure the distance between any pairs of (𝒚,𝒚~)𝒚~𝒚(\boldsymbol{y},\tilde{\boldsymbol{y}})( bold_italic_y , over~ start_ARG bold_italic_y end_ARG ). More basic notations please refer to Table 3. Based on the definitions of four basic elements (experience, domain, task, and measurement) above, we first restate the general definition of machine learning by mitchell1997machine to a more exact form as follows.

Definition 1 (Machine Learning, mitchell1997machine).

A computer program is said to learn from experience {\mathcal{E}}caligraphic_E with respect to a set of tasks {𝒯(t)}t=1Tsuperscriptsubscriptsuperscript𝒯𝑡𝑡1𝑇\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}{ caligraphic_T start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and performance measurement 𝒫𝒫{\mathcal{P}}caligraphic_P, if its performance at tasks {𝒯(t)}t=1Tsuperscriptsubscriptsuperscript𝒯𝑡𝑡1𝑇\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}{ caligraphic_T start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, as measured by 𝒫𝒫{\mathcal{P}}caligraphic_P, improves with experience {\mathcal{E}}caligraphic_E.

The definition above inherently considers both single-task and multi-task scenarios during the ML process but deviates from a meticulous definition to characterize MTL that includes recent developments. Now, let us first define STL to induce the formal definition of MTL.

Definition 2 (Single-Task Learning).

A type of machine learning specified by ,{𝒯(t)}t=1Tsuperscriptsubscriptsuperscript𝒯𝑡𝑡1𝑇{\mathcal{E}},\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}caligraphic_E , { caligraphic_T start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝒫𝒫{\mathcal{P}}caligraphic_P, where {𝒯(t)}t=1Tsuperscriptsubscriptsuperscript𝒯𝑡𝑡1𝑇\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}{ caligraphic_T start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT contains only one task (i.e. T=1𝑇1T=1italic_T = 1) on a specific domain 𝒟𝒟{\mathcal{D}}caligraphic_D.

As recent developments in MTL focus more on heterogeneous tasks (e.g., regression +++ classification) than homogeneous ones, each task should be represented by its own experience {\mathcal{E}}caligraphic_E on its corresponding domain 𝒟𝒟{\mathcal{D}}caligraphic_D. Due to this diversity, we always employ distinct measurement 𝒫𝒫{\mathcal{P}}caligraphic_P to evaluate the learning performance of each task. We accordingly define the MTL as follows.

Definition 3 (Multi-Task Learning).

A super set of STL specified by t=1T(t),{𝒯(t)}t=1Tsuperscriptsubscript𝑡1𝑇superscript𝑡superscriptsubscriptsuperscript𝒯𝑡𝑡1𝑇\bigcup_{t=1}^{T}{\mathcal{E}}^{(t)},\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}⋃ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_E start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , { caligraphic_T start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and {𝒫(t)}t=1Tsuperscriptsubscriptsuperscript𝒫𝑡𝑡1𝑇\{{\mathcal{P}}^{(t)}\}_{t=1}^{T}{ caligraphic_P start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where experience (t){𝐗(t),𝐘(t)}superscript𝑡superscript𝐗𝑡superscript𝐘𝑡{\mathcal{E}}^{(t)}\subseteq\{\boldsymbol{X}^{(t)},\boldsymbol{Y}^{(t)}\}caligraphic_E start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⊆ { bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_Y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } is with respect to task 𝒯(t)superscript𝒯𝑡{\mathcal{T}}^{(t)}caligraphic_T start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT on its corresponding domain 𝒟(t)superscript𝒟𝑡{\mathcal{D}}^{(t)}caligraphic_D start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. Accordingly, MTL is a computer program to learn from the experience set t=1T(t)superscriptsubscript𝑡1𝑇superscript𝑡\bigcup_{t=1}^{T}{\mathcal{E}}^{(t)}⋃ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_E start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT with respect to the task set {𝒯(t)}t=1Tsuperscriptsubscriptsuperscript𝒯𝑡𝑡1𝑇\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}{ caligraphic_T start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and the corresponding performance measurement set {𝒫(t)}t=1Tsuperscriptsubscriptsuperscript𝒫𝑡𝑡1𝑇\{{\mathcal{P}}^{(t)}\}_{t=1}^{T}{ caligraphic_P start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, if its total performance at any task 𝒯(t)superscript𝒯𝑡{\mathcal{T}}^{(t)}caligraphic_T start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, as measured by its corresponding 𝒫(t)superscript𝒫𝑡{\mathcal{P}}^{(t)}caligraphic_P start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, t=1,,T𝑡1𝑇t=1,\cdots,Titalic_t = 1 , ⋯ , italic_T, improves with experience set t=1T(t)superscriptsubscript𝑡1𝑇superscript𝑡\bigcup_{t=1}^{T}{\mathcal{E}}^{(t)}⋃ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_E start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.

We note that the formal MTL definition above has no conflict with the homogeneous or heterogeneous MTL.

1.2. Related Fields

Having established a formal definition of MTL grounded in fundamental ML elements, a thorough understanding can be achieved by analytically comparing it with related domains. These include Transfer Learning (TL), Meta-Learning, and In-Context Learning (ICL), among others. This comparison not only clarifies the distinct characteristics of MTL but also situates it within the broader context of these interconnected fields.

Transfer Learning (TL)

TL (pan2009survey) is a prevalent learning paradigm that solves the problem of lacking labeled data when applying ML to real-world data (zhuang2020comprehensive; pan2009survey). Specifically, TL improves the performance of a target model on target domains by transferring the knowledge in different but related source domains to the target domains. Such properties make TL well-appreciated in real-world applications, such as healthcare (kao2021toward; song2021transfer; perez2021transfer) and recommender systems (tl_recom_www21; liu2021leveraging; tl_recom_cikm21). According to the availability of labels in the source and target domains, TL is categorized into three types, i.e., transductive TL (aka Domain Adaptation (DA)redko2019advances; patel2015visual), inductive TL, and unsupervised TL (zhuang2020comprehensive; pan2009survey).

Few-Shot Learning (FSL)

FSL (fink2004object; fei2006one; wang2020generalizing) is a specific application case of TL. It aims at obtaining a model for the target task under a certain scenario where limited labeled samples from the target domain are available (wang2020generalizing). FSL is well-acknowledged in tackling different real-world problems such as identifying atypical ailments (quellec2020automatic; jia2020few), visual navigation (al2022zero; luo2021few), and cold-start item recommendation (sun2021mfnp; zhang2021model).

Meta-Learning

Meta-Learning (hospedales2021meta) is an implementation approach to achieve TL. The main concept is to obtain a meta-learner (a model) that can have satisfying performance for an unseen target domain (hospedales2021meta). Such meta-learner first extracts the meta-knowledge, i.e., the universally applicable principles, across source domains. With meta-knowledge, the meta-learner can be easily generalized to the target domain by leveraging the target samples. Meta-learning has been successfully applied in various problems such as hyper-parameter optimization (bohdal2021evograd; raghu2021meta), algorithm selection for data mining (simchowitz2021bayesian), and neural architecture search (NAS) (lee2021hardware; ding2022learning).

Though TL paradigms, including FSL and meta-learning, involve multi-domain data, their ultimate goal is to obtain a model with satisfied performance or can be easily generalized to one target task. In other words, TL leverages the knowledge in different tasks to assist the model in learning a single task, which intersects with MTL according to our definition in Definition 3. Thus, TL can bring merits to MTL, such as capturing the relations among tasks and extracting shared knowledge among involved tasks. Notably, the transfer of knowledge from pretrained foundation models (PFMs) proves beneficial for a myriad of downstream tasks in recent advancements (bommasani2021opportunities; zhou2023comprehensive).

Lifelong Learning

Lifelong Learning (parisi2019continual), aka Continual Learning, Sequential Learning, or Incremental Learning, studies the problem of learning from an infinite stream of data (de2021continual). The goal is to gradually extend the acquired knowledge and use it for future data, mitigating the occurrence of catastrophic forgetting or interference (mcclelland1995there). With only a small portion of the input data from one or few tasks available at once, lifelong learning particularly tends to preserve the knowledge learned from the previous input when learning on new data, i.e., addressing the stability-plasticity dilemma (grossberg2012studies). There are extensive applications of lifelong learning in solving tasks in ever-evolving systems, such as recommendations (chen2021towards; yao2021device) and anomaly detection (peng2021lime; doshi2022rethinking). Lifelong learning differs from MTL in the sense that its training object is a dynamic data stream, while MTL studies data from multiple tasks available at the beginning of the learning process.

Multi-View Learning (MVL)

MVL (xu2013survey; zhao2017multi; li2018survey) studies the problem of jointly learning from multi-view data samples, whose goal is to optimize the generalization performance for the jointly learning model (li2018survey). In real-world applications, the multi-view data indicates objects being described by multi-modal measurements, such as image+text, audio+video, and audio+articulation. Multi-Instance Multi-Label learning (MIML) (zhou2012multi) is a specific subtype of MVL, where an example is described by multiple instances and associated with multiple class labels. Due to the vast existence of multi-view data in realistic, MVL has attracted much attention in both research and industry, and the respective solutions play essential roles in cross-media retrieval (zhen2019deep; huang2020forward), video analysis (wang2022cascade; zellers2021merlot), recommender system (wei2022contrastive; chai2022knowledge), etc. MVL, including MIML, can be considered a specialized form of MTL, where the input contains data from multiple domains that are handled as distinct tasks, but the output is still in one label space.

In-Context Learning (ICL)

ICL (dong2022survey) has aroused interest as a novel learning paradigm for natural language processing (NLP) within Large Language Models (LLMs). ICL relies on templates in natural language that can demonstrate different tasks, such as solving mathematical reasoning problems (wei2022chain) and learning natural language inference (NLI) (liu2021natural). LLMs can then make predictions by taking this demonstration and its corresponding query pair as input. While both ICL and MTL involve leveraging shared knowledge or context to enhance task generalizability, ICL is specifically tailored to the target task within a narrower scope in real-world applications. However, recent large PFMs, like GPT-4 (openai2023gpt4), are inherently task-agnostic, accommodating various tasks owing to the diversity of demonstration templates encountered during their large-scale training stage.

1.3. Motivation and Benefit

MTL can be motivated from the following five perspectives with different benefits: cognitive/social psychology, data augmentation, learning efficacy, real-world scenarios, and learning theory.

  • Psychologically, humans are inherent with flexible adaptability to new problems and settings, as the human learning process can transfer knowledge from one experience to another (national2000people). Therefore, MTL is inspired by simulating this process to empower a model with the potentiality of multitasking. Coincidentally, another example of this knowledge transfer happens among organizations (argote2000knowledge). It is proved that organizations with more effective knowledge transfer are more productive and likely to survive than those with less. These prior successes of transfers or mutualizations in other areas encourage the joint learning of tasks in ML (caruana1997multitask).

  • In the pre-big data era, real-world problems were usually represented by small but high-dimensional datasets (##\## samples<#absent#<\#< # features). This data bottleneck forces early methods to learn a sparse-structured model, which always leads to a parsimonious solution to a problem with insufficient data. However, the MTL emerged to aggregate labeled data from different domains or tasks to enlarge the training dataset against overfitting.

  • The pursuit of efficiency and effectiveness is also one of the motivations. MTL can aggregate data from different sources together, and the joint training process of multiple tasks can save both computation and storage resources. In addition, the potential of performance enhancement makes it popular in research communities. In brief, universal representations for any tasks can be learned from multi-source data, and benefit all tasks in terms of both the learning cost and performance.

  • Motivated by the majority of real-world problems naturally being multimodal or multitasking, MTL is proposed to remedy the suboptimal achieved by STL that only models parts of the whole problem separately. For example, predicting the progression of Alzheimer’s Disease (AD) biomarkers for Mild Cognitive Impairment (MCI) risk and clinical diagnosis is simultaneously based on multimodal data such as computed tomography (CT), Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) (jie2015manifold; kwak2018multi; chen2022machine). Autonomous driving, another example, also involves multiple subtasks to calculate the final prediction (yang2018end; chowdhuri2019multinet), including the recognition of surrounding objects, adjustments to the fastest route according to the traffic conditions, the balance between efficiency and safety, etc.

  • From the perspective of learning theory, bias-free learning is proved to be impossible (mitchell1980need), so we can motivate the MTL by using the extra training signals for related tasks. Generally, MTL is one of the ways to achieve inductive transfer via multitasking assistance, which improves both learning speed and generalization. Specifically, during the process of the combined training of multiple tasks, some tasks can be provided inductive bias from other related tasks, and these stronger inductive biases (compared with universal regularizers, e.g., 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) enable the knowledge transfer and yield more generalization abilities on a fixed training dataset. In other words, task-related biases make a learner prefer hypotheses that can explain more than one task and prevent specific task from overfitting.

1.4. Mechanism and Explanation

In this section, we explore three key mechanisms – regularization, inductive bias, and feature sharing – shedding light on how MTL operates to achieve enhanced performance across multiple tasks.

Regularization

In MTL, the total loss function is a combination of multiple loss terms with respect to each task. The related tasks play a role as regularizers, enhancing the generalizability across them. The hypothesis space of an MTL model is confined to a more limited scope as it tackles multiple tasks simultaneously. Consequently, this constraint on the hypothesis space reduces model complexity, mitigating the risk of overfitting.

Inductive Bias

The training signals from co-training tasks act as mutual inductive biases due to their shared domain information. These biases facilitate cross-task knowledge transfer during training, guiding the model to favor task-related concepts rather than the tasks themselves. Consequently, this broadens the model’s horizons beyond a singular task, enhancing its generalization capabilities for unseen out-of-distribution (OOD) data.

Feature Sharing

MTL can enable feature sharing across related tasks. One approach involves selecting overlapping features and maximizing their utility across all tasks. This is referred to as “eavesdropping” (ruder2017overview), considering that some features may be unavailable for specific tasks but can be substituted by that learned from related tasks. Another way is to concatenate all the features extracted by different tasks together; these features can be holistically used across tasks via linear combination or nonlinear transformation.

Overall, MTL can be an efficient and effective way to boost the performance of the ML model on multiple tasks by regularization, inductive transfer, and feature sharing.

1.5. Contributions and Highlights

Existing Surveys. ruder2017overview is a pioneering survey in MTL, offering a broad overview of MTL and focusing on advances in deep neural networks from 2015 to 2017. thung2018brief reviews MTL methods from a taxonomy perspective of input-output variants, mainly concentrating on traditional MTL prior to 2016. These two reviews can be complementary materials to each other. vafaeikia2020brief is an incomplete survey that briefly reviews recent deep MTL approaches, particularly focusing on the selection of auxiliary tasks for enhanced learning performance. crawshaw2020multi presents the well-established and advanced MTL methods before 2020 from the perspective of applications. vandenhende2021multi provides a comprehensive review of deep MTL in dense prediction tasks, which generate pixel-level predictions such as in semantic segmentation and monocular depth estimation. zhang2021survey first give a comprehensive overview of MTL models from the taxonomy of feature-based and parameter-based approaches, but with limited inclusion of deep learning (DL) methods. Notably, all these surveys overlook the development of MTL in the last three or four years, named the era of large PFMs (bommasani2021opportunities; zhou2023comprehensive), exemplified by the GPT-series models (radford2018improving; radford2019language; brown2020language; openai2023gpt4).

Roadmap. This survey adopts a well-organized structure, distinguishing it from its predecessors, to demonstrate the evolutionary journey of MTL from traditional methods to DL and the innovative paradigm shift introduced by PFMs, as shown in Fig. 1. In § 2.1, we provide a comprehensive summary of traditional MTL techniques, including feature selection, feature transformation, decomposition, low-rank factorization, priori sharing, and task clustering. Moving forward, § 2.2 is devoted to exploring the critical dimensions of deep MTL methodologies, encompassing feature fusion, cascading, knowledge distillation, cross-task attention, scalarization, multi-objective optimization (MOO), adversarial training, Mixture-of-Experts (MoE), graph-based methods, and NAS. The recent advancements in PFMs are introduced in § 2.3, categorized based on task-generalizable fine-tuning, task promptable engineering, as well as task-agnostic unification. Additionally, we provide a concise overview of the miscellaneous aspects of MTL in § 3. § 4 provides valuable resources and tools to enhance the engagement of researchers and practitioners with MTL. Our discussions and future directions are presented in § 5, followed by our conclusion in § 6. The goal of this review is threefold: 1) to provide a comprehensive understanding of MTL for newcomers; 2) to function as a toolbox or handbook for engineering practitioners; and 3) to inspire experts by providing insights into the future directions and potentials of MTL.

2. MTL Models

Refer to caption
(a) Single-Task Learning (STL).
Refer to caption
(b) Multi-Task Learning (MTL).
Figure 4. The comparison of general framework between STL and MTL. (a) In STL, the learning function f𝑓fitalic_f is trained on a single dataset (𝑿,𝒀)𝑿𝒀{(\boldsymbol{X},\boldsymbol{Y})}( bold_italic_X , bold_italic_Y ), where 𝑿𝑿\boldsymbol{X}bold_italic_X represents the input data and 𝒚𝒚\boldsymbol{y}bold_italic_y represents the corresponding labels. The function f𝑓fitalic_f is parametrized by 𝑾𝑾\boldsymbol{W}bold_italic_W, and is trained to minimize a predefined loss function (𝒀~,𝒀)~𝒀𝒀{\mathcal{L}}(\tilde{\boldsymbol{Y}},\boldsymbol{Y})caligraphic_L ( over~ start_ARG bold_italic_Y end_ARG , bold_italic_Y ), where 𝒀~~𝒀\tilde{\boldsymbol{Y}}over~ start_ARG bold_italic_Y end_ARG is the prediction value. Once f𝑓fitalic_f is trained, it can be used to generalize to unseen data. (b) In MTL, the learning pipeline is similar to STL, but instead of training on a single dataset, multiple datasets are combined for different tasks. The multiple tasks are learned jointly by optimizing multiple loss functions (1)(𝒀~(1),𝒀(1)),,(T)(𝒀~(T),𝒀(T))superscript1superscript~𝒀1superscript𝒀1superscript𝑇superscript~𝒀𝑇superscript𝒀𝑇{\mathcal{L}}^{(1)}(\tilde{\boldsymbol{Y}}^{(1)},\boldsymbol{Y}^{(1)}),...,{% \mathcal{L}}^{(T)}(\tilde{\boldsymbol{Y}}^{(T)},\boldsymbol{Y}^{(T)})caligraphic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_Y end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_italic_Y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) , … , caligraphic_L start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_Y end_ARG start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT , bold_italic_Y start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ) simultaneously. It should be noted that although multiple tasks are learned jointly, the generalization of each task can still be performed independently.
Formalization

In machine learning, no matter the problem (discriminative, generative, adversarial, etc.), we hope to learn a predictive model by minimizing the regularized empirical loss as

(1) min𝑾(f𝑾(𝑿),𝒀)+λΩ(𝑾),subscript𝑾subscript𝑓𝑾𝑿𝒀𝜆Ω𝑾\min\limits_{\boldsymbol{W}}{\mathcal{L}}(f_{\boldsymbol{W}}(\boldsymbol{X}),% \boldsymbol{Y})+\lambda\Omega(\boldsymbol{W}),roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT ( bold_italic_X ) , bold_italic_Y ) + italic_λ roman_Ω ( bold_italic_W ) ,

where (𝑿,𝒀)𝑿𝒀(\boldsymbol{X},\boldsymbol{Y})( bold_italic_X , bold_italic_Y ) is data pairs sampled from a single task, and 𝑾𝑾\boldsymbol{W}bold_italic_W includes weights of learning model f()𝑓f(\cdot)italic_f ( ⋅ ). In general, {\mathcal{L}}caligraphic_L measures the distance between the predictions and ground-truth, and ΩΩ\Omegaroman_Ω adds constraints to the learning model, e.g., sparsity. The trade-off parameter λ𝜆\lambdaitalic_λ controls the balance between the loss and penalty. Fig. 4(a) shows the detailed framework of STL. In comparison, as shown in Fig. 4(b), the optimization in MTL is conducted on the multiple loss functions to achieve joint learning, and each task can maintain a specific loss function. Accordingly, MTL considers the problem in the following:

(2) min{𝑾(t)}t=1Tt=1T(t)(f𝑾(t)(𝑿(t)),𝒀(t))+λΩ(𝑾(1),,𝑾(T)),subscriptsuperscriptsubscriptsuperscript𝑾𝑡𝑡1𝑇superscriptsubscript𝑡1𝑇superscript𝑡subscript𝑓superscript𝑾𝑡superscript𝑿𝑡superscript𝒀𝑡𝜆Ωsuperscript𝑾1superscript𝑾𝑇\min\limits_{\{\boldsymbol{W}^{(t)}\}_{t=1}^{T}}\sum_{t=1}^{T}{\mathcal{L}}^{(% t)}\left(f_{\boldsymbol{W}^{(t)}}(\boldsymbol{X}^{(t)}),\boldsymbol{Y}^{(t)}% \right)+\lambda\Omega\left(\boldsymbol{W}^{(1)},\cdots,\boldsymbol{W}^{(T)}% \right),roman_min start_POSTSUBSCRIPT { bold_italic_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , bold_italic_Y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) + italic_λ roman_Ω ( bold_italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , bold_italic_W start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ) ,

where T𝑇Titalic_T denotes the number of tasks, and f()𝑓f(\cdot)italic_f ( ⋅ ) is the MTL model to be learned. In MTL, f()𝑓f(\cdot)italic_f ( ⋅ ) always encodes both task-specific and -shared representations, and Ω()Ω\Omega(\cdot)roman_Ω ( ⋅ ) builds task relatedness and reciprocity; both contribute to the effectiveness and efficiency of MTL.

Refer to caption
(a) SIMO.
Refer to caption
(b) MISO.
Refer to caption
(c) MIMO.
Figure 5. The classification of MTL problems into three different input/output configurations: (a) single-input multi-output (MISO), (b) multi-input single-output (MISO), and (c) multi-input multi-output (MISO).
I/O Configurations

To accommodate data in Eq. (2), it is necessary to consider various input/output (I/O) configurations that may impose constraints on the MTL modeling process. For instance, tasks such as semantic segmentation and depth estimation can utilize the same input images, and the applications are always developed using datasets where each image is attached with dense prediction labels for both segmentation and depth. On the other hand, when dealing with a digital recognition problem involving multiple domains (e.g., handwritten digits and license plate digits), different inputs are mapped to the same output space. We refer the former as a single-input multi-output (SIMO) configuration and the latter as a multi-input single-output (MISO) configuration. In MTL, the most prevalent scenarios reside in multi-input multi-output (MIMO) configuration where each task maintains its own set of samples and the labels are omnivorous, e.g., autonomous driving that involves pedestrian detection and traffic sign recognition. Let us denote the data input space and its corresponding label space for the t𝑡titalic_t-th task (t=1,,T)𝑡1𝑇(t=1,\cdots,T)( italic_t = 1 , ⋯ , italic_T ) by 𝒳(t)superscript𝒳𝑡\mathcal{X}^{(t)}caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝒴(t)superscript𝒴𝑡\mathcal{Y}^{(t)}caligraphic_Y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, respectively. We classify the MTL problems into three cases: SIMO, MISO, and MIMO. Fig. 5 shows the illustration of these three configurations. It is worth noting that the I/O configurations do not significantly impact the taxonomy of methods in MTL. As indicated in Table 2, there are numerous shared practices of applying different methods to these I/O configurations, as well as various data modalities and task types.

Table 2. Summary of MTL methods discussed in § 2.
I/O Data Modality Task Type
MTL Strategy Assumption SIMO MISO MIMO Table Image Text Graph Regression Classification Dense Prediction
Feature Selection 1
Decomposition 1
Regularization Low-Rank Factorization 1
Priori Sharing 1
Task Clustering/Grouping 1
Group-Based Learning 1
Relationship Learning Mixture-of-Experts 1
Feature Fusion 2
Cascading 2
Knowledge Distillation 2
Feature Propagation Cross-Task Attention 2
Scalarization 3
Multi-Objective Optimization 3
Adversarial Training 3
Optimization Neural Architecture Search 1
Downstream Fine-tuning 1
Task Prompting 1
Pre-training Multi-Modal Unification 1

 indicates common practice in the research community.  indicates not applicable due to technical constraints.

Taxonomy

MTL has seen significant advancement prior to the DL era (caruna1993multitask; caruana1997multitask; bakker2003task; ando2005framework; obozinski2006multi; zhang2006a). Initially, there was a strong focus on weight/parameter regularization, including sparse learning for cross-task feature selection, low-rank learning to uncover underlying factors, and decomposition methods to capture informative components. These approaches, while innovative in integrating intuitive variations from existing methods (e.g., the 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT regularizer derived from the classic 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularizer), still face limitations in practical applications due to the idealistic assumptions and a lack of consideration for task relationships. The emergence of methods like task clustering, priori sharing, graph-based learning, and MoE marked a shift towards more effective task relationship modeling. With the transition to the DL era, the abundance of features learned from architectures like convolutional neural networks (CNNs) (fukushima1980neocognitron; lecun1998gradient), recurrent neural networks (RNNs) (werbos1988generalization; hochreiter1997long) and Transformers (vaswani2017attention; dosovitskiy2020image) spurred the exploration of feature propagation methods, such as feature fusion, cascading, knowledge distillation (KD), and cross-task attention, all crucial for leveraging multi-source features. Alternatively, optimization-based methods, including scalarization, MOO, adversarial training and NAS, focused on gradients to harmonize optimization directions across tasks. These methods, while not restricted by I/O configurations, are constrainted on the pre-defined number of tasks and the use of heterogeneous architectures. Pre-training techniques, which leverages TL, markes a significant advancement towards unified and versatile multitasking, breaking limitations related to data modalities, dimensions, task numbers, model architectures, etc. The only cost is the large computation resources to train a really large model that can accommodate multi-task distributions. The MTL models are accordingly organized into five categories: regularization, relationship learning, feature propagation, optimization, and pre-training. Each contains a series of topics arranged chronologically in § 2.1 (traditional ML era), § 2.2 (DL era), and § 2.3 (PFM era). All of these topics can be inferred from three self-evident assumptions (but have been extensively validated by empirical evidence) as below:

Assumption 1 (Parameter Relatedness).

Under the same hypothesis space, models learned to perform related tasks can exhibit similarities.

Assumption 2 (Feature Richness).

Given the same level of experience, expanding the number of tasks to be learned can enhance the richness of features.

Assumption 3 (Optimization Consistency).

Learning multiple related tasks jointly in a single model can ensure consistency in optimization directions for each task.

We acknowledge that the presented taxonomy is not exhaustive, and certain methods may be classified differently when viewed from a different perspective. For example, Task Tree (TAT) (han2015learning), a clustering MTL method, establishes task hierarchy by decomposing the parameter matrix into different component matrices for each tree layer; we discuss it within the context of clustering MTL (see § 2.1.6). We also acknowledge that some methods that may be of interest to readers may not be included in this survey due to similarities or oversight. We welcome paper recommendations and will update the survey on our project page accordingly.222https://github.com/junfish/Awesome-Multitask-Learning. In Table 2, we summarize their assumptions, common practice, and technical constraints of these topics in terms of I/O configuration, data modality, and task type.

2.1. Traditional Era: Provable but Restrictive

Table 3. Summary of basic notations used in this paper.
Notation Description
n,N𝑛𝑁n,N\in\mathbb{R}italic_n , italic_N ∈ blackboard_R Scalars are denoted by plain lowercase or uppercase letters.
#object The number of object, e.g. #task denoting the number of task.
𝒙𝒙\boldsymbol{x}bold_italic_x oder 𝒙N𝒙superscript𝑁\vec{\boldsymbol{x}}\in\mathbb{R}^{N}over→ start_ARG bold_italic_x end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT A vector 𝒙𝒙\boldsymbol{x}bold_italic_x with N𝑁Nitalic_N entries, denoted by bold lowercase letters.
𝑿M×N𝑿superscript𝑀𝑁\boldsymbol{X}\in\mathbb{R}^{M\times N}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT A matrix 𝑿𝑿\boldsymbol{X}bold_italic_X with size M×N𝑀𝑁M\times Nitalic_M × italic_N, denoted by bold uppercase letters.
𝓧I1××IN𝓧superscriptsubscript𝐼1subscript𝐼𝑁\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}}bold_caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT A tensor 𝓧𝓧\boldsymbol{\mathcal{X}}bold_caligraphic_X with size I1××INsuperscriptsubscript𝐼1subscript𝐼𝑁\mathbb{R}^{I_{1}\times\cdots\times I_{N}}blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, denoted by bold calligraphic letters.
{(i)}i=1Nsuperscriptsubscriptsuperscript𝑖𝑖1𝑁\{\star^{(i)}\}_{i=1}^{N}{ ⋆ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT A set contains (1),,(N)superscript1superscript𝑁\star^{(1)},\cdots,\star^{(N)}⋆ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , ⋆ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT, where \star could be anything, e.g., scalar, vector, data pair, learner, etc.
xnsubscript𝑥𝑛x_{n}\in\mathbb{R}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R The n𝑛nitalic_n-th entry for vector 𝒙N,n{1,2,,N}formulae-sequence𝒙superscript𝑁𝑛12𝑁\boldsymbol{x}\in\mathbb{R}^{N},n\in\{1,2,\cdots,N\}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_n ∈ { 1 , 2 , ⋯ , italic_N }.
xm,nsubscript𝑥𝑚𝑛x_{m,n}italic_x start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT oder [𝑿]m,nsubscriptdelimited-[]𝑿𝑚𝑛[\boldsymbol{X}]_{m,n}\in\mathbb{R}[ bold_italic_X ] start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ∈ blackboard_R The (m,n)𝑚𝑛(m,n)( italic_m , italic_n )-th entry of matrix 𝑿M×N,m{1,2,,M},n{1,2,,N}formulae-sequence𝑿superscript𝑀𝑁formulae-sequence𝑚12𝑀𝑛12𝑁\boldsymbol{X}\in\mathbb{R}^{M\times N},m\in\{1,2,\cdots,M\},n\in\{1,2,\cdots,N\}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT , italic_m ∈ { 1 , 2 , ⋯ , italic_M } , italic_n ∈ { 1 , 2 , ⋯ , italic_N }.
𝑿𝒀M×Ndirect-product𝑿𝒀superscript𝑀𝑁\boldsymbol{X}\odot\boldsymbol{Y}\in\mathbb{R}^{M\times N}bold_italic_X ⊙ bold_italic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT Element-wise product of 𝑿M×N𝑿superscript𝑀𝑁\boldsymbol{X}\in\mathbb{R}^{M\times N}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT and 𝒀M×N𝒀superscript𝑀𝑁\boldsymbol{Y}\in\mathbb{R}^{M\times N}bold_italic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT, which means the (m,n)𝑚𝑛(m,n)( italic_m , italic_n )-th entry of 𝑿𝒀direct-product𝑿𝒀\boldsymbol{X}\odot\boldsymbol{Y}bold_italic_X ⊙ bold_italic_Y is xm,nym,nsubscript𝑥𝑚𝑛subscript𝑦𝑚𝑛x_{m,n}y_{m,n}italic_x start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT.
𝒙nMsuperscript𝒙𝑛superscript𝑀\boldsymbol{x}^{n}\in\mathbb{R}^{M}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT The n𝑛nitalic_n-th column vector of matrix 𝑿M×N,n{1,2,,N}formulae-sequence𝑿superscript𝑀𝑁𝑛12𝑁\boldsymbol{X}\in\mathbb{R}^{M\times N},n\in\{1,2,\cdots,N\}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT , italic_n ∈ { 1 , 2 , ⋯ , italic_N }.
𝒙mNsubscript𝒙𝑚superscript𝑁\boldsymbol{x}_{m}\in\mathbb{R}^{N}bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT The m𝑚mitalic_m-th row vector of matrix 𝑿M×N,m{1,2,,M}formulae-sequence𝑿superscript𝑀𝑁𝑚12𝑀\boldsymbol{X}\in\mathbb{R}^{M\times N},m\in\{1,2,\cdots,M\}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT , italic_m ∈ { 1 , 2 , ⋯ , italic_M }.
𝑰N×NN×Nsubscript𝑰𝑁𝑁superscript𝑁𝑁\boldsymbol{I}_{N\times N}\in\mathbb{R}^{N\times N}bold_italic_I start_POSTSUBSCRIPT italic_N × italic_N end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT The identity matrix of size N×N𝑁𝑁N\times Nitalic_N × italic_N, which has ones on the diagonal and zeros elsewhere.
tr(𝑿)𝑿(\boldsymbol{X})\in\mathbb{R}( bold_italic_X ) ∈ blackboard_R The trace of a matrix 𝑿N×N𝑿superscript𝑁𝑁\boldsymbol{X}\in\mathbb{R}^{N\times N}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, defined as the sum of its N𝑁Nitalic_N components on the diagonal.
col(𝑿)M𝑿superscript𝑀(\boldsymbol{X})\subseteq\mathbb{R}^{M}( bold_italic_X ) ⊆ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT The column space of a matrix 𝑿M×N𝑿superscript𝑀𝑁\boldsymbol{X}\in\mathbb{R}^{M\times N}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT, which consists of all linear combinations of its column vectors.
rank(𝑿)𝑿(\boldsymbol{X})\in\mathbb{R}( bold_italic_X ) ∈ blackboard_R The rank of matrix 𝑿𝑿\boldsymbol{X}bold_italic_X, defined as the maximum number of linearly independent column (or row) vectors of 𝑿𝑿\boldsymbol{X}bold_italic_X.
vec(𝑿)MN𝑿superscript𝑀𝑁(\boldsymbol{X})\in\mathbb{R}^{MN}( bold_italic_X ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_M italic_N end_POSTSUPERSCRIPT The vectorization of the matrix 𝑿M×N𝑿superscript𝑀𝑁\boldsymbol{X}\in\mathbb{R}^{M\times N}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT in the row-by-row stacking way.
𝑫+N×Msuperscript𝑫superscript𝑁𝑀\boldsymbol{D}^{+}\in\mathbb{R}^{N\times M}bold_italic_D start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_M end_POSTSUPERSCRIPT The pseudoinverse of a matrix 𝑫M×N𝑫superscript𝑀𝑁\boldsymbol{D}\in\mathbb{R}^{M\times N}bold_italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT.
𝑶NN×Nsuperscript𝑶𝑁superscript𝑁𝑁\boldsymbol{O}^{N}\subset\mathbb{R}^{N\times N}bold_italic_O start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT The set of N×N𝑁𝑁N\times Nitalic_N × italic_N orthogonal matrices.
𝑿𝑶N𝑿superscript𝑶𝑁\boldsymbol{X}\in\boldsymbol{O}^{N}bold_italic_X ∈ bold_italic_O start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT The column vectors 𝒙1,,𝒙Nsuperscript𝒙1superscript𝒙𝑁\boldsymbol{x}^{1},\cdots,\boldsymbol{x}^{N}bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , bold_italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT of matrix 𝑿𝑿\boldsymbol{X}bold_italic_X are orthogonal.
𝑺NN×Nsuperscript𝑺𝑁superscript𝑁𝑁\boldsymbol{S}^{N}\subset\mathbb{R}^{N\times N}bold_italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT The set of N×N𝑁𝑁N\times Nitalic_N × italic_N real symmetric matrices.
𝑺+N𝑺Nsuperscriptsubscript𝑺𝑁superscript𝑺𝑁\boldsymbol{S}_{+}^{N}\subset\boldsymbol{S}^{N}bold_italic_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⊂ bold_italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT The subset of 𝑺Nsuperscript𝑺𝑁\boldsymbol{S}^{N}bold_italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT that contains positive semidefinit matrices.
𝒘1subscriptnorm𝒘1\|\boldsymbol{w}\|_{1}∥ bold_italic_w ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT The 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of a vector, calculated as the sum of the absolute vector values.
𝒘2subscriptnorm𝒘2\|\boldsymbol{w}\|_{2}∥ bold_italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT The 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of a vector, calculated as the square root of the sum of the squared vector values.
𝒘subscriptnorm𝒘\|\boldsymbol{w}\|_{\infty}∥ bold_italic_w ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT The subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm of a vector, calculated as the maximum of the absolute vector values.
𝑾0subscriptnorm𝑾0\|\boldsymbol{W}\|_{0}∥ bold_italic_W ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT The 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm, i.e., cardinality of a matrix, defined as the number of nonzero components.
𝑾1subscriptnorm𝑾1\|\boldsymbol{W}\|_{1}∥ bold_italic_W ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT The 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of a matrix, calculated as the maximum of the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of the column vectors.
𝑾2subscriptnorm𝑾2\|\boldsymbol{W}\|_{2}∥ bold_italic_W ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT The 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of a matrix, calculated as its maximum singular value.
𝑾Fsubscriptnorm𝑾𝐹\|\boldsymbol{W}\|_{F}∥ bold_italic_W ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT The Frobenius norm of a matrix, calculated as the square root of the sum of the squared matrix values.
{σr(𝑾)}r=1Rsuperscriptsubscriptsubscript𝜎𝑟𝑾𝑟1𝑅\{\sigma_{r}(\boldsymbol{W})\}_{r=1}^{R}{ italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_W ) } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT The set of non-increasing ordered singular values of matrix 𝑾𝑾\boldsymbol{W}bold_italic_W.
𝑾subscriptnorm𝑾\|\boldsymbol{W}\|_{*}∥ bold_italic_W ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT The trace norm of a matrix, defined as the sum of its singular values, i.e., r=1Rσr(𝑾)superscriptsubscript𝑟1𝑅subscript𝜎𝑟𝑾\sum_{r=1}^{R}\sigma_{r}(\boldsymbol{W})∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_W ).
𝑾subscriptnorm𝑾\|\boldsymbol{W}\|_{\infty}∥ bold_italic_W ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT The subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm of a matrix, calculated as the maximum of the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of the row vectors.
𝑾p,qsubscriptnorm𝑾𝑝𝑞\|\boldsymbol{W}\|_{p,q}∥ bold_italic_W ∥ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT The p,qsubscript𝑝𝑞\ell_{p,q}roman_ℓ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT norm of a matrix, defined as the q𝑞qitalic_q-norm of the vector whose components are p𝑝pitalic_p-norm of 𝑾𝑾~{}\boldsymbol{W}bold_italic_W’s row vectors.
𝑾1,1subscriptnorm𝑾11\|\boldsymbol{W}\|_{1,1}∥ bold_italic_W ∥ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT The 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT norm of a matrix, defined as the sum of the absolute matrix components.
𝑾1,2subscriptnorm𝑾12\|\boldsymbol{W}\|_{1,2}∥ bold_italic_W ∥ start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT The 1,2subscript12\ell_{1,2}roman_ℓ start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT norm of a matrix, calculated as the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of the vector whose components are 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of the row vectors.
𝑾2,1subscriptnorm𝑾21\|\boldsymbol{W}\|_{2,1}∥ bold_italic_W ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT The 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm of a matrix, calculated as the sum of the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of the row vectors.

To establish a unified formulation, we start the review of traditional methods by defining a common framework. The notations for subsequent discussions are summarized in Table 3. Building upon this, we initiate our discussion with multiple standard regression models for each task as a paradigm. The weights of these homogeneous models can be arranged into one weight matrix, catalyzing a series of MTL studies through matrix regularization techniques in the traditional era. We denote by {(𝑿(t),𝒚(t))}t=1Tsuperscriptsubscriptsuperscript𝑿𝑡superscript𝒚𝑡𝑡1𝑇\{(\boldsymbol{X}^{(t)},\boldsymbol{y}^{(t)})\}_{t=1}^{T}{ ( bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT our dataset across T𝑇Titalic_T tasks. For each task indexed by t=1,2,,T𝑡12𝑇t={1,2,\cdots,T}italic_t = 1 , 2 , ⋯ , italic_T, we are given Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT samples with D𝐷Ditalic_D features, i.e., 𝑿(t)Nt×Dsuperscript𝑿𝑡superscriptsubscript𝑁𝑡𝐷\boldsymbol{X}^{(t)}\in\mathbb{R}^{N_{t}\times D}bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, and the corresponding response values 𝒚(t)Ntsuperscript𝒚𝑡superscriptsubscript𝑁𝑡\boldsymbol{y}^{(t)}\in\mathbb{R}^{N_{t}}bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

The single-task setting of these multiple linear regression problems is

(3) 𝒚(t)=𝑿(t)𝒘(t)+ϵ(t),t=1,,T,formulae-sequencesuperscript𝒚𝑡superscript𝑿𝑡superscript𝒘𝑡superscriptitalic-ϵ𝑡𝑡1𝑇\boldsymbol{y}^{(t)}={\boldsymbol{X}^{(t)}}\boldsymbol{w}^{(t)}+\epsilon^{(t)}% ,t=1,\cdots,T,bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_t = 1 , ⋯ , italic_T ,

where 𝒘(t)Dsuperscript𝒘𝑡superscript𝐷\boldsymbol{w}^{(t)}\in\mathbb{R}^{D}bold_italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT for any t{1,,T}𝑡1𝑇t\in\{1,\cdots,T\}italic_t ∈ { 1 , ⋯ , italic_T }, ϵ(t)𝒩(0,σt2𝕀)similar-tosuperscriptitalic-ϵ𝑡𝒩0superscriptsubscript𝜎𝑡2𝕀\epsilon^{(t)}\sim\mathcal{N}(0,\sigma_{t}^{2}\mathbb{I})italic_ϵ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_I ) is the error term independent of 𝑿(t)superscript𝑿𝑡\boldsymbol{X}^{(t)}bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is determined by the system state for t𝑡titalic_t-th task. Each model is separately learned from independent samples {(𝒙1(t),y1(t)),,(𝒙Nt(t),yNt(t))}superscriptsuperscriptsubscript𝒙1𝑡topsuperscriptsubscript𝑦1𝑡superscriptsuperscriptsubscript𝒙subscript𝑁𝑡𝑡topsuperscriptsubscript𝑦subscript𝑁𝑡𝑡\{({\boldsymbol{x}_{1}^{(t)}}^{\top},y_{1}^{(t)}),\cdots,({\boldsymbol{x}_{N_{% t}}^{(t)}}^{\top},y_{N_{t}}^{(t)})\}{ ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , ⋯ , ( bold_italic_x start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) }.

A trivial simplification of the above linear regressions is that all tasks maintain the same feature size D𝐷Ditalic_D, thus leading to a natural idea of stacking weight vectors for these tasks: 𝑾=[𝒘(1),,𝒘(T)]D×T𝑾superscript𝒘1superscript𝒘𝑇superscript𝐷𝑇\boldsymbol{W}=[\boldsymbol{w}^{(1)},\cdots,\boldsymbol{w}^{(T)}]\in\mathbb{R}% ^{D\times T}bold_italic_W = [ bold_italic_w start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , bold_italic_w start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_T end_POSTSUPERSCRIPT, where the matrix-based regularizers come into play. To estimate as 𝑾𝑾\boldsymbol{W}bold_italic_W, the MTL method minimizes the objective function:

(4) min𝑾t=1T1nt(t)(𝑿(t)𝒘t,𝒚(t))+λΩ(𝑾),subscript𝑾superscriptsubscript𝑡1𝑇1subscript𝑛𝑡superscript𝑡superscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡𝜆Ω𝑾\min\limits_{\boldsymbol{W}}\sum\limits_{t=1}^{T}\frac{1}{n_{t}}\mathcal{L}^{(% t)}\left({\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t},\boldsymbol{y}^{(t)}\right)+% \lambda\Omega(\boldsymbol{W}),roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) + italic_λ roman_Ω ( bold_italic_W ) ,

where we consider the weights of multiple models, i.e., 𝑾𝑾\boldsymbol{W}bold_italic_W, as a union, and denote by 𝒘tsuperscript𝒘𝑡\boldsymbol{w}^{t}bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT the t𝑡titalic_t-th column of 𝑾𝑾\boldsymbol{W}bold_italic_W. Normally, an identical loss function, e.g., mean squared error (MSE), is always applied to {(t)}t=1Tsuperscriptsubscriptsuperscript𝑡𝑡1𝑇\{{\mathcal{L}}^{(t)}\}_{t=1}^{T}{ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, which originates from the i.i.d.formulae-sequence𝑖𝑖𝑑i.i.d.italic_i . italic_i . italic_d . assumption of {ϵ(t)}t=1Tsuperscriptsubscriptsuperscriptitalic-ϵ𝑡𝑡1𝑇\{\epsilon^{(t)}\}_{t=1}^{T}{ italic_ϵ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. To capture task relatedness from the Assumption 1 that multiple models are similar to each other, ΩΩ\Omegaroman_Ω is designed to take various regularization forms in traditional MTL. The overview of regularization techniques used in the traditional ML era for MTL (will be discussed in the following subsections) is presented in Table 2.1.

Table 4. Summary of regularization technique used in MTL.
Model Name Origin Year Typ Matrix Regularizer Vector Formalization
Regularized MTL KDD evgeniou2004regularized Group regularization Frobenius norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1t=1T𝒘t1Tt=1T𝒘t22+λ2t=1T𝒘t22subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑡1𝑇subscriptsuperscriptnormsuperscript𝒘𝑡1𝑇superscriptsubscript𝑡1𝑇superscript𝒘𝑡22subscript𝜆2superscriptsubscript𝑡1𝑇subscriptsuperscriptnormsuperscript𝒘𝑡22\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% _{1}\sum_{t=1}^{T}{\|\boldsymbol{w}^{t}-\frac{1}{T}\sum_{t=1}^{T}\boldsymbol{w% }^{t}\|}^{2}_{2}+\lambda_{2}\sum_{t=1}^{T}{\|\boldsymbol{w}^{t}\|}^{2}_{2}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Learning Multiple Tasks with Kernel Methods JMLR evgeniou2005learning Priori Sharing Adaptive penalty min𝑽,𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λt=1T𝒘t𝑽+𝒘t,subscript𝑽𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑡1𝑇superscriptsuperscript𝒘𝑡topsuperscript𝑽superscript𝒘𝑡\min\limits_{\boldsymbol{V},\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N% _{t}}\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}% +\lambda\sum_{t=1}^{T}{\boldsymbol{w}^{t}}^{\top}\boldsymbol{V}^{+}\boldsymbol% {w}^{t},roman_min start_POSTSUBSCRIPT bold_italic_V , bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ,  s.t. 𝑽𝑺+D,𝑽superscriptsubscript𝑺𝐷\boldsymbol{V}\in\boldsymbol{S}_{+}^{D},bold_italic_V ∈ bold_italic_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , 𝑽𝑺D𝑽superscript𝑺𝐷\boldsymbol{V}\in\boldsymbol{S}^{D}bold_italic_V ∈ bold_italic_S start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT
Alternating structure optimization JMLR ando2005framework Decomposition Frobenius norm min{𝑾,𝑽},Θ12t=1T1Nt𝑿(t)(𝒘t+Θ𝒗t)𝒚t22+λd=1D𝒘d22subscript𝑾𝑽Θ12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡superscriptsubscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscriptΘtopsuperscript𝒗𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑑1𝐷superscriptsubscriptnormsubscript𝒘𝑑22\min\limits_{\{\boldsymbol{W},\boldsymbol{V}\},\Theta}\frac{1}{2}\sum_{t=1}^{T% }\frac{1}{N_{t}}\|{\boldsymbol{X}^{(t)}}(\boldsymbol{w}^{t}+\Theta^{\top}% \boldsymbol{v}^{t})-\boldsymbol{y}^{t}\|_{2}^{2}+\lambda\sum_{d=1}^{D}\|% \boldsymbol{w}_{d}\|_{2}^{2}roman_min start_POSTSUBSCRIPT { bold_italic_W , bold_italic_V } , roman_Θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + roman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,  s.t. ΘΘ=𝑰h×hΘsuperscriptΘtopsubscript𝑰\Theta\Theta^{\top}=\boldsymbol{I}_{h\times h}roman_Θ roman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_italic_I start_POSTSUBSCRIPT italic_h × italic_h end_POSTSUBSCRIPT
Multi-task feature selection Tech. Rep.1 obozinski2006multi Group-sparse learning 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λd=1D𝒘d2subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑2\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% \sum_{d=1}^{D}{\|\boldsymbol{w}_{d}\|}_{2}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Multi-task Lasso Thesis2 zhang2006a Group-sparse learning ,1subscript1\ell_{\infty,1}roman_ℓ start_POSTSUBSCRIPT ∞ , 1 end_POSTSUBSCRIPT norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λd=1D𝒘dsubscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% \sum_{d=1}^{D}{\|\boldsymbol{w}_{d}\|}_{\infty}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
Multi-task feature learning NeurIPS argyriou2006multi Group-sparse learning, feature learning 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm min𝑼,𝑾12t=1T1Nt(𝑿(t)𝑼)𝒘t𝒚(t)22+λ(d=1D𝒘d2)2subscript𝑼𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡𝑼superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsuperscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑22\min\limits_{\boldsymbol{U},\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N% _{t}}\|({\boldsymbol{X}^{(t)}}\boldsymbol{U})\boldsymbol{w}^{t}-\boldsymbol{y}% ^{(t)}\|^{2}_{2}+\lambda(\sum_{d=1}^{D}{\|\boldsymbol{w}_{d}\|}_{2})^{2}roman_min start_POSTSUBSCRIPT bold_italic_U , bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ ( bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_U ) bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ( ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,  s.t. 𝑼𝑶D𝑼superscript𝑶𝐷\boldsymbol{U}\in\boldsymbol{O}^{D}bold_italic_U ∈ bold_italic_O start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT
Convex multi-task feature learning Mach. Lea. argyriou2008convex Feature learning Adaptive penalty min𝑽,𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λt=1T𝒘t𝑽+𝒘t,subscript𝑽𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑡1𝑇superscriptsuperscript𝒘𝑡topsuperscript𝑽superscript𝒘𝑡\min\limits_{\boldsymbol{V},\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N% _{t}}\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}% +\lambda\sum_{t=1}^{T}{\boldsymbol{w}^{t}}^{\top}\boldsymbol{V}^{+}\boldsymbol% {w}^{t},roman_min start_POSTSUBSCRIPT bold_italic_V , bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ,  s.t. 𝑽𝑺+D,𝑽superscriptsubscript𝑺𝐷\boldsymbol{V}\in\boldsymbol{S}_{+}^{D},bold_italic_V ∈ bold_italic_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , tr(𝑽)1𝑽1(\boldsymbol{V})\leq 1( bold_italic_V ) ≤ 1, col(𝑾)𝑾absent(\boldsymbol{W})\subseteq( bold_italic_W ) ⊆col(𝑽)𝑽(\boldsymbol{V})( bold_italic_V )
Low rank MTL ICML ji2009accelerated Low-rank learning Trace norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ𝑾subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆subscriptnorm𝑾\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% \|\boldsymbol{W}\|_{*}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ bold_italic_W ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT
Convex ASO ICML chen2009convex min𝑼,Θ12t=1T1Nt𝑿(t)𝒖t𝒚t22+λη(1η)tr(𝑼(η𝑰+ΘΘ)1𝑼),s.t.ΘΘ=𝑰h×hformulae-sequencesubscript𝑼Θ12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡superscriptsubscriptnormsuperscript𝑿𝑡superscript𝒖𝑡superscript𝒚𝑡22𝜆𝜂1𝜂trsuperscript𝑼topsuperscript𝜂𝑰superscriptΘtopΘ1𝑼𝑠𝑡ΘsuperscriptΘtopsubscript𝑰\min\limits_{\boldsymbol{U},\Theta}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{u}^{t}-\boldsymbol{y}^{t}\|_{2}^{2}+\lambda% \eta(1-\eta)\text{tr}(\boldsymbol{U}^{\top}(\eta\boldsymbol{I}+\Theta^{\top}% \Theta)^{-1}\boldsymbol{U}),~{}~{}s.t.~{}\Theta\Theta^{\top}=\boldsymbol{I}_{h% \times h}roman_min start_POSTSUBSCRIPT bold_italic_U , roman_Θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ italic_η ( 1 - italic_η ) tr ( bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_η bold_italic_I + roman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Θ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_U ) , italic_s . italic_t . roman_Θ roman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_italic_I start_POSTSUBSCRIPT italic_h × italic_h end_POSTSUBSCRIPT
Dirty block-sparse model NeurIPS jalali2010dirty Group-sparse learning, decomposition ,1subscript1\ell_{\infty,1}roman_ℓ start_POSTSUBSCRIPT ∞ , 1 end_POSTSUBSCRIPT norm +++ 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT norm min𝑾12t=1T1Nt𝑿(t)(𝒔t+𝒃t)𝒚(t)22+λ1d=1D𝒔d1+λ2d=1D𝒃dsubscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒔𝑡superscript𝒃𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒔𝑑1subscript𝜆2superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒃𝑑\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}(\boldsymbol{s}^{t}+\boldsymbol{b}^{t})-\boldsymbol{y}^{(% t)}\|^{2}_{2}+\lambda_{1}\sum_{d=1}^{D}{\|\boldsymbol{s}_{d}\|}_{1}+\lambda_{2% }\sum_{d=1}^{D}{\|\boldsymbol{b}_{d}\|}_{\infty}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_b start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT,  s.t. 𝑾=𝑺+𝑩𝑾𝑺𝑩\boldsymbol{W}=\boldsymbol{S}+\boldsymbol{B}bold_italic_W = bold_italic_S + bold_italic_B
Sparse multi-task Lasso NeurIPS lee2010adaptive Group-sparse learning Weighted 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm +++ weighted 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1Dρd𝒘d2+λ2d=1Dθd𝒘d1subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscript𝜌𝑑subscriptnormsubscript𝒘𝑑2subscript𝜆2superscriptsubscript𝑑1𝐷subscript𝜃𝑑subscriptnormsubscript𝒘𝑑1\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\rho_{d}{\|\boldsymbol{w}_{d}\|}_{2}+\lambda_{2}\sum_{d=1}^{% D}\theta_{d}{\|\boldsymbol{w}_{d}\|}_{1}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
\cdashline1-6 Weighted 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm +++ weighted 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1Dρd𝒘d2+λ2d=1Dθd𝒘d1+logZ(𝝆,𝜽)subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscript𝜌𝑑subscriptnormsubscript𝒘𝑑2subscript𝜆2superscriptsubscript𝑑1𝐷subscript𝜃𝑑subscriptnormsubscript𝒘𝑑1𝑍𝝆𝜽\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\rho_{d}{\|\boldsymbol{w}_{d}\|}_{2}+\lambda_{2}\sum_{d=1}^{% D}\theta_{d}{\|\boldsymbol{w}_{d}\|}_{1}+\log Z(\boldsymbol{\rho},\boldsymbol{% \theta})roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_log italic_Z ( bold_italic_ρ , bold_italic_θ ),
Adaptive multi-task Lasso NeurIPS lee2010adaptive Group-sparse learning +++ adaptive penalty P(𝑾|𝝆,𝜽)=1Z(𝝆,𝜽)d=1Dt=1Texp(θd|wn,t|)×d=1Dexp(ρd𝐰d2)𝑃conditional𝑾𝝆𝜽1𝑍𝝆𝜽superscriptsubscriptproduct𝑑1𝐷superscriptsubscriptproduct𝑡1𝑇subscript𝜃𝑑subscript𝑤𝑛𝑡superscriptsubscriptproduct𝑑1𝐷subscript𝜌𝑑subscriptnormsubscript𝐰𝑑2P(\boldsymbol{W}|\boldsymbol{\rho},\boldsymbol{\theta})=\frac{1}{Z(\boldsymbol% {\rho},\boldsymbol{\theta})}\prod_{d=1}^{D}\prod_{t=1}^{T}\exp(-\theta_{d}% \lvert w_{n,t}\rvert)\times\prod_{d=1}^{D}\exp(-\rho_{d}\|\mathbf{w}_{d}\|_{2})italic_P ( bold_italic_W | bold_italic_ρ , bold_italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( bold_italic_ρ , bold_italic_θ ) end_ARG ∏ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_exp ( - italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT | ) × ∏ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT roman_exp ( - italic_ρ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ bold_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
min𝐌0,,𝐌Tγ0𝐌0𝐈F2+t=1T[γt𝐌tF2+(i,j)Jt,jidt2(𝐱i,𝐱j)+(i,j,k)Stξijk]subscriptsubscript𝐌0subscript𝐌𝑇subscript𝛾0superscriptsubscriptnormsubscript𝐌0𝐈𝐹2superscriptsubscript𝑡1𝑇delimited-[]subscript𝛾𝑡superscriptsubscriptnormsubscript𝐌𝑡𝐹2subscriptformulae-sequence𝑖𝑗subscript𝐽𝑡𝑗𝑖superscriptsubscript𝑑𝑡2subscript𝐱𝑖subscript𝐱𝑗subscript𝑖𝑗𝑘subscript𝑆𝑡subscript𝜉𝑖𝑗𝑘\min\limits_{\mathbf{M}_{0},\ldots,\mathbf{M}_{T}}\gamma_{0}\|\mathbf{M}_{0}-% \mathbf{I}\|_{F}^{2}+\sum\nolimits_{t=1}^{T}\left[\gamma_{t}\|\mathbf{M}_{t}\|% _{F}^{2}+\sum\nolimits_{(i,j)\in J_{t},j\neq i}d_{t}^{2}(\mathbf{x}_{i},% \mathbf{x}_{j})+\sum\nolimits_{(i,j,k)\in S_{t}}\xi_{ijk}\right]roman_min start_POSTSUBSCRIPT bold_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_I ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ bold_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ italic_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_j ≠ italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT ( italic_i , italic_j , italic_k ) ∈ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ]
Large margin multi-task metric learning NeurIPS parameswaran2010large Priori Sharing Frobenius norm s.t. t,(i,j,k)St:dt2(𝐱i,𝐱k)dt2(𝐱i,𝐱j)1ξijk;ξijk0;𝐌0,𝐌1,,𝐌T0:for-all𝑡for-all𝑖𝑗𝑘subscript𝑆𝑡formulae-sequencesuperscriptsubscript𝑑𝑡2subscript𝐱𝑖subscript𝐱𝑘superscriptsubscript𝑑𝑡2subscript𝐱𝑖subscript𝐱𝑗1subscript𝜉𝑖𝑗𝑘formulae-sequencesubscript𝜉𝑖𝑗𝑘0subscript𝐌0subscript𝐌1subscript𝐌𝑇0\forall t,\forall(i,j,k)\in S_{t}\colon\quad d_{t}^{2}(\mathbf{x}_{i},\mathbf{% x}_{k})-d_{t}^{2}(\mathbf{x}_{i},\mathbf{x}_{j})\geq 1-\xi_{ijk};\xi_{ijk}\geq 0% ;\mathbf{M}_{0},\mathbf{M}_{1},\ldots,\mathbf{M}_{T}\geq 0∀ italic_t , ∀ ( italic_i , italic_j , italic_k ) ∈ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≥ 1 - italic_ξ start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ; italic_ξ start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ≥ 0 ; bold_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≥ 0
Hierarchical multitask structured output learning NeurIPS gornitz2011hierarchical Priori Sharing Frobenius norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+12t=1T𝒘22λ𝒘T𝒘psubscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡2212superscriptsubscript𝑡1𝑇superscriptsubscriptnorm𝒘22𝜆superscript𝒘𝑇subscript𝒘𝑝\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\frac{1% }{2}\sum_{t=1}^{T}||\boldsymbol{w}||_{2}^{2}-\lambda\boldsymbol{w}^{T}% \boldsymbol{w}_{p}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_λ bold_italic_w start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, where p𝑝pitalic_p is the parent node.
low-rank learning
Robust MTL KDD chen2011integrating Decomposition, group-sparse learning, Trace norm + 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm min𝑾12t=1T𝑿(t)(𝒍t+𝒔t)𝒚(t)22+λ1𝑳+λ2t=1T𝒔t2subscript𝑾12superscriptsubscript𝑡1𝑇superscriptsubscriptnormsuperscript𝑿𝑡superscript𝒍𝑡superscript𝒔𝑡superscript𝒚𝑡22subscript𝜆1subscriptnorm𝑳subscript𝜆2superscriptsubscript𝑡1𝑇subscriptnormsubscript𝒔𝑡2\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\|{\boldsymbol{X}^{(t)}}(% \boldsymbol{l}^{t}+\boldsymbol{s}^{t})-\boldsymbol{y}^{(t)}\|_{2}^{2}+\lambda_% {1}\|\boldsymbol{L}\|_{*}+\lambda_{2}\sum_{t=1}^{T}\|\boldsymbol{s}_{t}\|_{2}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_l start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + bold_italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_L ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,  s.t. 𝑾=𝑳+𝑺𝑾𝑳𝑺\boldsymbol{W}=\boldsymbol{L}+\boldsymbol{S}bold_italic_W = bold_italic_L + bold_italic_S
Temporal group Lasso KDD zhou2011multi Group-sparse learning Frobenius norm + 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1D𝒘d22+λ2t=1T1𝒘t𝒘t+122+λ3d=1D𝒘d2subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷superscriptsubscriptnormsubscript𝒘𝑑22subscript𝜆2superscriptsubscript𝑡1𝑇1superscriptsubscriptnormsuperscript𝒘𝑡superscript𝒘𝑡122subscript𝜆3superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑2\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\|\boldsymbol{w}_{d}\|_{2}^{2}+\lambda_{2}\sum_{t=1}^{T-1}\|% \boldsymbol{w}^{t}-\boldsymbol{w}^{t+1}\|_{2}^{2}+\lambda_{3}\sum_{d=1}^{D}\|% \boldsymbol{w}_{d}\|_{2}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Clustered MTL NeurIPS zhou2011clustered task clustering Clustering penalty + 2,2subscript22\ell_{2,2}roman_ℓ start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT norm min𝑾,𝑭12t=1T1Nt𝑿(t)𝒘t𝒚t22+λ1(tr(𝑾𝑾)tr(𝑭𝑾𝑾𝑭))+λ2t=1T𝒘t22,subscript𝑾𝑭12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡superscriptsubscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1trsuperscript𝑾top𝑾trsuperscript𝑭topsuperscript𝑾top𝑾𝑭subscript𝜆2superscriptsubscript𝑡1𝑇subscriptsuperscriptnormsuperscript𝒘𝑡22\min\limits_{\boldsymbol{W},\boldsymbol{F}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N% _{t}}\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{t}\|_{2}^{2}+% \lambda_{1}(\text{tr}(\boldsymbol{W}^{\top}\boldsymbol{W})-\text{tr}(% \boldsymbol{F}^{\top}\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{F}))+% \lambda_{2}\sum_{t=1}^{T}{\|\boldsymbol{w}^{t}\|}^{2}_{2},roman_min start_POSTSUBSCRIPT bold_italic_W , bold_italic_F end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( tr ( bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W ) - tr ( bold_italic_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W bold_italic_F ) ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
s.t.𝑭t,j=1/njift𝒞jotherwise0,s.t.subscript𝑭𝑡𝑗1subscript𝑛𝑗if𝑡subscript𝒞𝑗otherwise0~{}~{}\text{s.t.}~{}\boldsymbol{F}_{t,j}=1/\sqrt{n_{j}}~{}\text{if}~{}t\in% \mathcal{C}_{j}~{}\text{otherwise}~{}0,s.t. bold_italic_F start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT = 1 / square-root start_ARG italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG if italic_t ∈ caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT otherwise 0 , t=1,,T𝑡1𝑇t=1,\cdots,Titalic_t = 1 , ⋯ , italic_T, where njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the #task in the j𝑗jitalic_j-th cluster 𝒞jsubscript𝒞𝑗\mathbf{\mathcal{C}}_{j}caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
Decomposition, sparse learning,
Sparse and low rank MTL TKDD chen2012learning low-rank learning 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT norm + trace norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λd=1D𝒑d1subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒑𝑑1\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% \sum_{d=1}^{D}\|\boldsymbol{p}_{d}\|_{1}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,  s.t. 𝑾=𝑷+𝑸,𝑸τformulae-sequence𝑾𝑷𝑸subscriptnorm𝑸𝜏\boldsymbol{W}=\boldsymbol{P}+\boldsymbol{Q},\|\boldsymbol{Q}\|_{*}\leq\taubold_italic_W = bold_italic_P + bold_italic_Q , ∥ bold_italic_Q ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≤ italic_τ
Convex fused sparse group Lasso KDD zhou2012modeling Group-sparse learning 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT norm +++ 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1D𝒘d1+λ2t=1T1𝒘t𝒘t+11+λ3d=1D𝒘d2subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑1subscript𝜆2superscriptsubscript𝑡1𝑇1subscriptnormsuperscript𝒘𝑡superscript𝒘𝑡11subscript𝜆3superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑2\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\|\boldsymbol{w}_{d}\|_{1}+\lambda_{2}\sum_{t=1}^{T-1}\|% \boldsymbol{w}^{t}-\boldsymbol{w}^{t+1}\|_{1}+\lambda_{3}\sum_{d=1}^{D}\|% \boldsymbol{w}_{d}\|_{2}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Adaptive multi-task elastic-net SDM chen2012adaptive Group-sparse learning 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm +++ Frobenius norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1D𝒘d2+λ2d=1D𝒘d22subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑2subscript𝜆2superscriptsubscript𝑑1𝐷superscriptsubscriptnormsubscript𝒘𝑑22\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}{\|\boldsymbol{w}_{d}\|}_{2}+\lambda_{2}\sum_{d=1}^{D}\|% \boldsymbol{w}_{d}\|_{2}^{2}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Multi-level Lasso ICML lozano2012multi Decomposition, sparse learning 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT norm + adaptive penalty min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1Dθd+λ2d=1D𝜸d1subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscript𝜃𝑑subscript𝜆2superscriptsubscript𝑑1𝐷subscriptnormsubscript𝜸𝑑1\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\theta_{d}+\lambda_{2}\sum_{d=1}^{D}\|\boldsymbol{% \boldsymbol{\gamma}}_{d}\|_{1}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_γ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,  s.t. 𝑾=𝜽𝚲𝚪,𝜽𝟎formulae-sequence𝑾𝜽𝚲𝚪𝜽0\boldsymbol{W}=\vec{\boldsymbol{\theta}}\boldsymbol{\Lambda}\boldsymbol{\Gamma% },\vec{\boldsymbol{\theta}}\geq\boldsymbol{0}bold_italic_W = over→ start_ARG bold_italic_θ end_ARG bold_Λ bold_Γ , over→ start_ARG bold_italic_θ end_ARG ≥ bold_0
Robust multi-task feature learning KDD gong2012robust Decomposition, group-sparse learning 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm + 1,2subscript12\ell_{1,2}roman_ℓ start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT norm min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1D𝒑d2+λ2d=1D𝒒d12subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒑𝑑2subscript𝜆2superscriptsubscript𝑑1𝐷superscriptsubscriptnormsubscript𝒒𝑑12\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\|\boldsymbol{p}_{d}\|_{2}+\lambda_{2}\sqrt{\sum_{d=1}^{D}\|% \boldsymbol{q}_{d}\|_{1}^{2}}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG,  s.t. 𝑾=𝑷+𝑸𝑾𝑷𝑸\boldsymbol{W}=\boldsymbol{P}+\boldsymbol{Q}bold_italic_W = bold_italic_P + bold_italic_Q
Multi-stage multi-task feature learning NeurIPS gong2012multi Sparse learning Capped 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm (zhang2010analysis) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λd=1Rmin{𝒘d1,τ}subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑑1𝑅subscriptnormsubscript𝒘𝑑1𝜏\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% \sum_{d=1}^{R}\min\{\|\boldsymbol{w}_{d}\|_{1},\tau\}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT roman_min { ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ }
Convex formulation for MTL IJCAI zhang2012convex Priori sharing Clustering penalty min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ12subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆12\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\frac{% \lambda_{1}}{2}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARGtr(𝑾𝑾T)+λ22𝑾superscript𝑾𝑇subscript𝜆22(\boldsymbol{W}\boldsymbol{W}^{T})+\frac{\lambda_{2}}{2}( bold_italic_W bold_italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + divide start_ARG italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARGtr(𝑾𝛀1𝑾T)𝑾superscript𝛀1superscript𝑾𝑇(\boldsymbol{W}\boldsymbol{\Omega}^{-1}\boldsymbol{W}^{T})( bold_italic_W bold_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )  s.t. 𝛀𝑺+D𝛀superscriptsubscript𝑺𝐷\boldsymbol{\Omega}\in\boldsymbol{S}_{+}^{D}bold_Ω ∈ bold_italic_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, tr𝛀=1𝛀1\boldsymbol{\Omega}=1bold_Ω = 1
Multi-linear multi-task learning ICML romera2013multilinear Low-rank learning Overlapped tensor trace norm min𝓦12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λk=1N𝑾(k)subscript𝓦12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑘1𝑁subscriptnormsubscript𝑾𝑘\min\limits_{\boldsymbol{\mathcal{W}}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}% \|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+% \lambda\sum_{k=1}^{N}\|\boldsymbol{W}_{(k)}\|_{*}roman_min start_POSTSUBSCRIPT bold_caligraphic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ bold_italic_W start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT where 𝑾(k)subscript𝑾𝑘\boldsymbol{W}_{(k)}bold_italic_W start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT is the mode-k𝑘kitalic_k unfolding of tensor 𝓦D×I2××IN𝓦superscript𝐷subscript𝐼2subscript𝐼𝑁\boldsymbol{\mathcal{W}}\in\mathbb{R}^{D\times I_{2}\times\cdots\times I_{N}}bold_caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × ⋯ × italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.
Regularization approach to learn MTL TKDD zhang2014regularization Priori sharing Clustering penalty + 2,2subscript22\ell_{2,2}roman_ℓ start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT norm min𝑽,𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ2t=1T𝒘t22+subscript𝑽𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22limit-from𝜆2superscriptsubscript𝑡1𝑇superscriptsubscriptnormsuperscript𝒘𝑡22\min\limits_{\boldsymbol{V},\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N% _{t}}\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}% +\frac{\lambda}{2}\sum_{t=1}^{T}||\boldsymbol{w}^{t}||_{2}^{2}+roman_min start_POSTSUBSCRIPT bold_italic_V , bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | | bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT +tr(𝑾𝛀1𝑾T)+d𝑾superscript𝛀1superscript𝑾𝑇𝑑(\boldsymbol{W}\boldsymbol{\Omega}^{-1}\boldsymbol{W}^{T})+d( bold_italic_W bold_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_dln𝛀𝛀\boldsymbol{\Omega}bold_Ω  s.t. 𝛀𝑺+D𝛀superscriptsubscript𝑺𝐷\boldsymbol{\Omega}\in\boldsymbol{S}_{+}^{D}bold_Ω ∈ bold_italic_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT
Multi-linear multi-task learning NeurIPS wimalawarne2014multitask Low-rank learning Scaled latent tensor trace norm min𝓦12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+inf𝓦(1)++𝓦(N)=𝓦λk=1NIk1/2𝑾(k)(k)subscript𝓦12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscriptinfimumsuperscript𝓦1superscript𝓦𝑁𝓦𝜆superscriptsubscript𝑘1𝑁superscriptsubscript𝐼𝑘12subscriptnormsuperscriptsubscript𝑾𝑘𝑘\min\limits_{\boldsymbol{\mathcal{W}}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}% \|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\inf% _{\boldsymbol{\mathcal{W}}^{(1)}+\cdots+\boldsymbol{\mathcal{W}}^{(N)}=% \boldsymbol{\mathcal{W}}}\lambda\sum_{k=1}^{N}I_{k}^{-1/2}\|\boldsymbol{W}_{(k% )}^{(k)}\|_{*}roman_min start_POSTSUBSCRIPT bold_caligraphic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_inf start_POSTSUBSCRIPT bold_caligraphic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + ⋯ + bold_caligraphic_W start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT = bold_caligraphic_W end_POSTSUBSCRIPT italic_λ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∥ bold_italic_W start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT where 𝓦D×I2××IN𝓦superscript𝐷subscript𝐼2subscript𝐼𝑁\boldsymbol{\mathcal{W}}\in\mathbb{R}^{D\times I_{2}\times\cdots\times I_{N}}bold_caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × ⋯ × italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a tensor.
Task Tree model KDD han2015learning task clustering 2,2subscript22\ell_{2,2}roman_ℓ start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT norm min𝑾12t=1T1Nt𝑿(t)h=1H𝒘ht𝒚t22+h=1Hλhi<jT𝒘hi𝒘hj22,s.t.|𝒘h1i𝒘h1j||𝒘hi𝒘hj|,h2,i<jformulae-sequencesucceeds-or-equalssubscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡superscriptsubscriptnormsuperscript𝑿𝑡superscriptsubscript1𝐻superscriptsubscript𝒘𝑡superscript𝒚𝑡22superscriptsubscript1𝐻subscript𝜆superscriptsubscript𝑖𝑗𝑇subscriptsuperscriptnormsuperscriptsubscript𝒘𝑖superscriptsubscript𝒘𝑗22s.t.superscriptsubscript𝒘1𝑖superscriptsubscript𝒘1𝑗superscriptsubscript𝒘𝑖superscriptsubscript𝒘𝑗formulae-sequencefor-all2for-all𝑖𝑗\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\sum_{h=1}^{H}\boldsymbol{w}_{h}^{t}-\boldsymbol{y}^{t}\|% _{2}^{2}+\sum_{h=1}^{H}\lambda_{h}\sum_{i<j}^{T}\|\boldsymbol{w}_{h}^{i}-% \boldsymbol{w}_{h}^{j}\|^{2}_{2},\text{s.t.}|\boldsymbol{w}_{h-1}^{i}-% \boldsymbol{w}_{h-1}^{j}|\succeq|\boldsymbol{w}_{h}^{i}-\boldsymbol{w}_{h}^{j}% |,\forall h\geq 2,\forall i<jroman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i < italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , s.t. | bold_italic_w start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | ⪰ | bold_italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | , ∀ italic_h ≥ 2 , ∀ italic_i < italic_j
Reduced rank multi-stage MTL AAAI han2016multi Low-rank learning Capped trace norm (sun2013robust) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λr=1Rmin{σr(𝑾),τ}subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑟1𝑅subscript𝜎𝑟𝑾𝜏\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% \sum_{r=1}^{R}\min\{\sigma_{r}(\boldsymbol{W}),\tau\}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT roman_min { italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_W ) , italic_τ }
  • 1

    This work is published in Technical Report, the Department of Statistics, UC Berkeley.

  • 2

    This work is published in Jian Zhang’s Ph.D. Thesis, CMU Technical Report CMU-LTI-06-006, 2006.

2.1.1. Feature Selection

The high-dimensional scaling (negahban2008joint) where the number of model weights is much larger than that of the observations/features, i.e., DNmuch-greater-than𝐷𝑁D\gg Nitalic_D ≫ italic_N, arises in many real-world problems, leading it costly and arduous to seek effective predictor variables. Sparse learning with an 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularizer that aims to identify a structure characterized by a reduced number of non-zero elements. This parsimonious solution ensures the retention and selection of the most effective and efficient subset of features tailored to the target task (tibshirani1996regression). In MTL, Assumption 1 underpins the development of all sparse learning models. Under the settings of sparse learning, this assumption posits that similar sparsity patterns in model parameters suggest the relatedness between tasks. As a result, sparsity patterns subtly represent task relatedness, underscoring a subset of common features derived from these limited samples. More benefits and efficacy of employing sparsity in MTL have been thoroughly assessed and discussed in lounici2009taking. In this section, our discussion of feature selection in MTL encompasses both the block-wise (2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT) and element-wise (1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT) approaches. Each approach maintains both shared and task-specific features, optimizing performance across all tasks. In the block-wise approach, tasks can differentiate themselves from others’ priorities by attributing distinct weights to the commonly selected features. Conversely, the element-wise approach allows tasks to highlight their distinct preferences on predictors by opting for specific features in addition to the shared ones.

Block-Wise Sparsity

Multi-Task Feature Selection (obozinski2006multi) is the first method to address the problem of joint feature selection across a group of related tasks. This method extends the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization for STL to the 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT regularization for MTL. The assumption for 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT regularization scheme is that multiple related tasks have a similar preference for a few common features, which encourages a solution to share the sparsity pattern. Therefore, 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT imposes a sparse penalty on the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norms of the T𝑇Titalic_T-dimensional weight vectors associated with each feature across tasks (i.e., row vectors of the weight matrix 𝑾D×Tsuperscript𝑾𝐷𝑇\boldsymbol{W}^{D\times T}bold_italic_W start_POSTSUPERSCRIPT italic_D × italic_T end_POSTSUPERSCRIPT). This is formulated as follows:

(5) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λd=1D𝒘d2,subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑2\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% \sum\limits_{d=1}^{D}{\|\boldsymbol{w}_{d}\|}_{2},roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

which selects features globally via encouraging several feature-wise weight vectors 𝒘dsubscript𝒘𝑑\boldsymbol{w}_{d}bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT across all tasks to be 𝟎0\vec{\boldsymbol{0}}over→ start_ARG bold_0 end_ARG. The 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm imposed on feature-wise weight vectors (i.e., 𝒘dsubscript𝒘𝑑\boldsymbol{w}_{d}bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT) before 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm here is a magnitude measurement, which could be substituted by any other psubscript𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (p1𝑝1p\geq 1italic_p ≥ 1) norm (obozinski2006multi). This penalty term can be seen as a generalization of 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization when task number T=1𝑇1T=1italic_T = 1. To solve the problem (5), obozinski2006multi offers a block-coordinate descent optimization method to update the block of weight vector associated with each feature. liu2012multi proposes an accelerated algorithm by reformulating it as two equivalent smooth convex optimization problems.

Multi-Task Lasso (zhang2006a) extends the efficient p,1subscript𝑝1\ell_{p,1}roman_ℓ start_POSTSUBSCRIPT italic_p , 1 end_POSTSUBSCRIPT regularizers via imposing subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm to each feature-wise weight vector 𝒘dsubscript𝒘𝑑\boldsymbol{w}_{d}bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Based on the assumption that the number of effective predictor features is much smaller than the total features, Multi-task Lasso learns a sparser structure by

(6) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λd=1D𝒘d.subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% \sum\limits_{d=1}^{D}{\|\boldsymbol{w}_{d}\|}_{\infty}.roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT .

The use of subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT enforces the procedure to take the maximum value of each feature-wise vector 𝒘dsubscript𝒘𝑑\boldsymbol{w}_{d}bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT across all tasks. This is appropriate if relevant features are not shared by every task, and this situation frequently happens as the number of tasks grows. zhang2006a proves that this ,1subscript1\ell_{\infty,1}roman_ℓ start_POSTSUBSCRIPT ∞ , 1 end_POSTSUBSCRIPT problem can be solved by an efficient convex optimization technique. Furthermore, a full spectrum of p,1subscript𝑝1\ell_{p,1}roman_ℓ start_POSTSUBSCRIPT italic_p , 1 end_POSTSUBSCRIPT regularization (1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT, especially) suitable for MTL is investigated and discussed. However, negahban2008joint prove that the use of 1,subscript1\ell_{1,\infty}roman_ℓ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT can improve learning efficiency only if the overlap of feature entries across tasks is large enough (>2/3absent23>2/3> 2 / 3), as compared to the situation where each task learns Lasso problem separately.

Temporal Group Lasso (zhou2011multi) is an MTL formulation for predicting the disease progression, which considers t𝑡titalic_t time points of disease progression as related tasks. They first admit the limitation of task independence for the analytical solution 𝑾=(𝑿𝑿+λ1𝑰)1𝑿𝒀𝑾superscriptsuperscript𝑿top𝑿subscript𝜆1𝑰1superscript𝑿top𝒀\boldsymbol{W}=({\boldsymbol{X}^{\top}\boldsymbol{X}+\lambda_{1}\boldsymbol{I}% })^{-1}\boldsymbol{X}^{\top}\boldsymbol{Y}bold_italic_W = ( bold_italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_X + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Y to the ridge regression problem min𝑾𝑿𝑾𝒀F2+λ1𝑾F2subscript𝑾superscriptsubscriptnorm𝑿𝑾𝒀𝐹2subscript𝜆1superscriptsubscriptnorm𝑾𝐹2\min_{\boldsymbol{W}}{\|\boldsymbol{X}\boldsymbol{W}-\boldsymbol{Y}\|}_{F}^{2}% +\lambda_{1}{\|\boldsymbol{W}\|}_{F}^{2}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT ∥ bold_italic_X bold_italic_W - bold_italic_Y ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_W ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where 𝑿𝑿\boldsymbol{X}bold_italic_X is identical and 𝒀=[𝒚(1),,𝒚(T)]𝒀superscript𝒚1superscript𝒚𝑇\boldsymbol{Y}=[\boldsymbol{y}^{(1)},\cdots,\boldsymbol{y}^{(T)}]bold_italic_Y = [ bold_italic_y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , bold_italic_y start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ] denotes the progression of disease across T𝑇Titalic_T tasks (time points). To capture the temporal smoothness for the adjacent time points, Temporal Group Lasso adds the temporal smoothness term and feature selector term to form the formalization as

(7) min𝑾12S(𝑿𝑾𝒀)F2+λ1d=1D𝒘d22+λ2t=1T1𝒘t𝒘t+122+λ3d=1D𝒘d2,subscript𝑾12subscriptsuperscriptnormdirect-product𝑆𝑿𝑾𝒀2𝐹subscript𝜆1superscriptsubscript𝑑1𝐷superscriptsubscriptnormsubscript𝒘𝑑22subscript𝜆2superscriptsubscript𝑡1𝑇1superscriptsubscriptnormsuperscript𝒘𝑡superscript𝒘𝑡122subscript𝜆3superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑2\displaystyle\min\limits_{\boldsymbol{W}}\frac{1}{2}\|S\odot({\boldsymbol{X}}% \boldsymbol{W}-\boldsymbol{Y})\|^{2}_{F}+\lambda_{1}\sum\limits_{d=1}^{D}\|% \boldsymbol{w}_{d}\|_{2}^{2}+\lambda_{2}\sum\limits_{t=1}^{T-1}\|\boldsymbol{w% }^{t}-\boldsymbol{w}^{t+1}\|_{2}^{2}+\lambda_{3}\sum\limits_{d=1}^{D}\|% \boldsymbol{w}_{d}\|_{2},roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_S ⊙ ( bold_italic_X bold_italic_W - bold_italic_Y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where SN×T𝑆superscript𝑁𝑇S\in\mathbb{R}^{N\times T}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_T end_POSTSUPERSCRIPT is the indication matrix for the incomplete data, i.e., for any n{1,,N},t{1,,T},sn,t=0formulae-sequence𝑛1𝑁formulae-sequence𝑡1𝑇subscript𝑠𝑛𝑡0n\in\{1,\cdots,N\},t\in\{1,\cdots,T\},s_{n,t}=0italic_n ∈ { 1 , ⋯ , italic_N } , italic_t ∈ { 1 , ⋯ , italic_T } , italic_s start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = 0 if the target value of sample n𝑛nitalic_n at the t𝑡titalic_t-th time point is missing and sn,t=1subscript𝑠𝑛𝑡1s_{n,t}=1italic_s start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = 1 otherwise. It is noted that this problem can be easily solved by accelerated gradient method (AGM) (nesterov2013gradient) using SLEP (liu2009slep). However, to avoid the shrinkage of relevant features that would result in sub-optimal performance, zhou2011multi proposed a standard two-stage procedure to relax the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization.

Adaptive Multi-Task Elastic-Net (chen2012adaptive) aims to address the problem of collinearity existing in the multi-task feature selection method. Inspired by elastic-net (zou2005regularization), a natural thought is to add another quadratic penalty d=1D𝒘d22superscriptsubscript𝑑1𝐷superscriptsubscriptnormsubscript𝒘𝑑22\sum_{d=1}^{D}\|\boldsymbol{w}_{d}\|_{2}^{2}∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to the sparse multi-task constraint d=1D𝒘d2superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑2\sum_{d=1}^{D}\|\boldsymbol{w}_{d}\|_{2}∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which forms the corresponding multi-task elastic-net problem as

(8) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1D𝒘d2+λ2d=1D𝒘d22,subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑2subscript𝜆2superscriptsubscript𝑑1𝐷superscriptsubscriptnormsubscript𝒘𝑑22\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% _{1}\sum\limits_{d=1}^{D}{\|\boldsymbol{w}_{d}\|}_{2}+\lambda_{2}\sum\limits_{% d=1}^{D}\|\boldsymbol{w}_{d}\|_{2}^{2},roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the traditional 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT mixed norm learns the same amount of regularization across all features. As discussed below in the adaptive sparse multi-task lasso (lee2010adaptive), it is promising to learn different regularization weights {𝒘d}d=1Dsuperscriptsubscriptsubscript𝒘𝑑𝑑1𝐷\{\boldsymbol{w}_{d}\}_{d=1}^{D}{ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT for each feature. However, unlike the application of eQTL detection (lee2010adaptive) where features on single nucleotide polymorphisms (SNPs) make it easier to incorporate prior knowledge for each feature (see Eq. (10)), the priors scaling the importance of adaptive weights for each feature are always unavailable in many real-world problems. chen2012adaptive proposes a three-stage algorithm to estimate the adaptive weights 𝒘dsubscript𝒘𝑑\boldsymbol{w}_{d}bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT via using a data-driven method: (1) estimate the initial regression weights {𝒘^d}d=1Dsuperscriptsubscriptsubscript^𝒘𝑑𝑑1𝐷\{\hat{\boldsymbol{w}}_{d}\}_{d=1}^{D}{ over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT with uniform weight for each feature; (2) construct adaptive scaling weights {λ^d}d=1D,λ^d=(𝒘^d2)γsuperscriptsubscriptsubscript^𝜆𝑑𝑑1𝐷subscript^𝜆𝑑superscriptsubscriptnormsubscript^𝒘𝑑2𝛾\{\hat{\lambda}_{d}\}_{d=1}^{D},\hat{\lambda}_{d}=(\|\hat{\boldsymbol{w}}_{d}% \|_{2})^{-\gamma}{ over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = ( ∥ over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - italic_γ end_POSTSUPERSCRIPT according to the weights estimated in the first step, where γ𝛾\gammaitalic_γ is a fixed constant; (3) compute the final estimated parameters via the multi-task elastic-net with the adaptive scaling weights, i.e., 𝑾^=min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1Dλ^d𝒘d2+λ2d=1D𝒘d22^𝑾subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscript^𝜆𝑑subscriptnormsubscript𝒘𝑑2subscript𝜆2superscriptsubscript𝑑1𝐷superscriptsubscriptnormsubscript𝒘𝑑22\hat{\boldsymbol{W}}=\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}% \frac{1}{N_{t}}\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}% \|^{2}_{2}+\lambda_{1}\sum_{d=1}^{D}\hat{\lambda}_{d}{\|\boldsymbol{w}_{d}\|}_% {2}+\lambda_{2}\sum_{d=1}^{D}\|\boldsymbol{w}_{d}\|_{2}^{2}over^ start_ARG bold_italic_W end_ARG = roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Element-Wise Sparsity

Sparse Multi-Task Lasso (lee2010adaptive) allows feature-specific penalty magnitude by incorporating a set of priors with fixed scaling parameters. This method also generalizes the sparse group Lasso penalty (simon2013sparse) by suing both the 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT and 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT norms to perform joint block-wise and element-wise feature selection. Specifically, sparse multi-task Lasso proposes

(9) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1Dρd𝒘d2+λ2d=1Dθd𝒘d1,subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscript𝜌𝑑subscriptnormsubscript𝒘𝑑2subscript𝜆2superscriptsubscript𝑑1𝐷subscript𝜃𝑑subscriptnormsubscript𝒘𝑑1\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% _{1}\sum\limits_{d=1}^{D}\rho_{d}{\|\boldsymbol{w}_{d}\|}_{2}+\lambda_{2}\sum% \limits_{d=1}^{D}\theta_{d}{\|\boldsymbol{w}_{d}\|}_{1},roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where 𝝆=[ρ1,,ρD]𝝆superscriptsubscript𝜌1subscript𝜌𝐷top\boldsymbol{\rho}=[\rho_{1},\cdots,\rho_{D}]^{\top}bold_italic_ρ = [ italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_ρ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and 𝜽=[θ1,,θD]𝜽superscriptsubscript𝜃1subscript𝜃𝐷top\boldsymbol{\theta}=[\theta_{1},\cdots,\theta_{D}]^{\top}bold_italic_θ = [ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT are the scaling weights for the 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT and 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT regularizers, respectively. There exist two advantages of this method: (1) Unlike previous work by obozinski2006multi; zhang2006a, which considers p,1(p>1)subscript𝑝1𝑝1\ell_{p,1}~{}(p>1)roman_ℓ start_POSTSUBSCRIPT italic_p , 1 end_POSTSUBSCRIPT ( italic_p > 1 ) norm that learns block-wise sparsity well but overlooks element-wise sparsity within each feature group, sparse multi-task Lasso balances the 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT and 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT regularizers via λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to achieve both simultaneously. (2) Unlike obozinski2006multi; zhang2006a, which treats every feature-wise weight vectors ({𝒘d}d=1Dsuperscriptsubscriptsubscript𝒘𝑑𝑑1𝐷\{\boldsymbol{w}_{d}\}_{d=1}^{D}{ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT) equally, i.e., ρd=θd=1,d{1,,D}formulae-sequencesubscript𝜌𝑑subscript𝜃𝑑1𝑑1𝐷\rho_{d}=\theta_{d}=1,d\in\{1,\cdots,D\}italic_ρ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 1 , italic_d ∈ { 1 , ⋯ , italic_D }, the two scaling vectors in lee2010adaptive can be automatically learned from data. Furthermore, maurer2013sparse uses the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularizer on data preprocessed by a linear mapping function and provides bounds on the generalization error for both MTL and TL settings.

Refer to caption
Figure 6. The Bayesian graph for adaptive sparse multi-task Lasso model.

Adaptive Sparse Multi-Task Lasso (lee2010adaptive) is induced as a super-problem from above. This method adaptively incorporates prior knowledge on SNPs (brookes1999essence) to learn two scaling vectors 𝝆𝝆\boldsymbol{\rho}bold_italic_ρ and 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, which are defined as the mixtures of features on the j𝑗jitalic_j-th SNP

ρd=ivifidandθd=iωifid,d=1,,D,formulae-sequencesubscript𝜌𝑑subscript𝑖subscript𝑣𝑖superscriptsubscript𝑓𝑖𝑑andsubscript𝜃𝑑subscript𝑖subscript𝜔𝑖superscriptsubscript𝑓𝑖𝑑𝑑1𝐷\displaystyle\rho_{d}=\sum\limits_{i}\varv_{i}f_{i}^{d}~{}\text{and}~{}\theta_% {d}=\sum\limits_{i}\omega_{i}f_{i}^{d},d=1,\cdots,D,italic_ρ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_d = 1 , ⋯ , italic_D ,
(10) s.t.ivi=iωi=1,s.t.subscript𝑖subscript𝑣𝑖subscript𝑖subscript𝜔𝑖1\displaystyle\text{s.t.}\quad\sum\limits_{i}\varv_{i}=\sum\limits_{i}\omega_{i% }=1,s.t. ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ,

where fidsuperscriptsubscript𝑓𝑖𝑑f_{i}^{d}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the i𝑖iitalic_i-th feature of the d𝑑ditalic_d-th SNP. Here, the component xnt,d{0,1,2}subscript𝑥subscript𝑛𝑡𝑑012x_{n_{t},d}\in\{0,1,2\}italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_d end_POSTSUBSCRIPT ∈ { 0 , 1 , 2 } of 𝑿(t)Nt×Dsuperscript𝑿𝑡superscriptsubscript𝑁𝑡𝐷\boldsymbol{X}^{(t)}\in\mathbb{R}^{N_{t}\times D}bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT in Eq. (9) denotes the number of minor alleles at the d𝑑ditalic_d-th SNP of the ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-th sample. lee2010adaptive uses a directed graphical model as an elegant Bayesian tool to find the maximum a posteriori (MAP) estimate of all the above learnable weights, shown in Fig. 6. Then the conditional probability of weight matrix 𝑾𝑾\boldsymbol{W}bold_italic_W given 𝝆𝝆\boldsymbol{\rho}bold_italic_ρ and 𝜽𝜽\boldsymbol{\theta}bold_italic_θ is

(11) P(𝑾|𝝆,𝜽)=1Z(𝝆,𝜽)d=1Dt=1Texp(θd|wd,t|)×d=1Dexp(ρd𝐰d2),𝑃conditional𝑾𝝆𝜽1𝑍𝝆𝜽superscriptsubscriptproduct𝑑1𝐷superscriptsubscriptproduct𝑡1𝑇subscript𝜃𝑑subscript𝑤𝑑𝑡superscriptsubscriptproduct𝑑1𝐷subscript𝜌𝑑subscriptnormsubscript𝐰𝑑2P(\boldsymbol{W}|\boldsymbol{\rho},\boldsymbol{\theta})=\frac{1}{Z(\boldsymbol% {\rho},\boldsymbol{\theta})}\prod\limits_{d=1}^{D}\prod\limits_{t=1}^{T}\exp(-% \theta_{d}\lvert w_{d,t}\rvert)\times\prod\limits_{d=1}^{D}\exp(-\rho_{d}\|% \mathbf{w}_{d}\|_{2}),italic_P ( bold_italic_W | bold_italic_ρ , bold_italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( bold_italic_ρ , bold_italic_θ ) end_ARG ∏ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_exp ( - italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_d , italic_t end_POSTSUBSCRIPT | ) × ∏ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT roman_exp ( - italic_ρ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ bold_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,

where the normalization factor Z(𝝆,𝜽)𝑍𝝆𝜽Z(\boldsymbol{\rho},\boldsymbol{\theta})italic_Z ( bold_italic_ρ , bold_italic_θ ) is upper-bounded by the inference of high dimensional multivariate Laplace distribution (gomez1998multivariate). Accordingly, lee2010adaptive proposes an alternating minimization approach that iteratively optimizes one of (𝒗,𝝎)𝒗𝝎(\boldsymbol{\varv},\boldsymbol{\omega})( bold_italic_v , bold_italic_ω ) and 𝑾𝑾\boldsymbol{W}bold_italic_W by fixing another until convergence.

Convex Fused Sparse Group Lasso (cFSGL) (zhou2012modeling) considers a formulation that additionally allows the element-wise feature selection compared to the temporal group Lasso (zhou2011multi). cFSGL encourages the sparsity for joint feature selection across tasks and specific feature selection within a task. The formulation can be written as

(12) min𝑾12S(𝑿𝑾𝒀)F2+λ1d=1D𝒘d1+λ2t=1T1𝒘t𝒘t+11+λ3d=1D𝒘d2,subscript𝑾12subscriptsuperscriptnormdirect-product𝑆𝑿𝑾𝒀2𝐹subscript𝜆1superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑1subscript𝜆2superscriptsubscript𝑡1𝑇1subscriptnormsuperscript𝒘𝑡superscript𝒘𝑡11subscript𝜆3superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑2\displaystyle\min\limits_{\boldsymbol{W}}\frac{1}{2}{\|S\odot(\boldsymbol{X}% \boldsymbol{W}-\boldsymbol{Y})\|}^{2}_{F}+\lambda_{1}\sum\limits_{d=1}^{D}\|% \boldsymbol{w}_{d}\|_{1}+\lambda_{2}\sum\limits_{t=1}^{T-1}\|\boldsymbol{w}^{t% }-\boldsymbol{w}^{t+1}\|_{1}+\lambda_{3}\sum\limits_{d=1}^{D}\|\boldsymbol{w}_% {d}\|_{2},roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_S ⊙ ( bold_italic_X bold_italic_W - bold_italic_Y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where t=1T1𝒘t𝒘t+11superscriptsubscript𝑡1𝑇1subscriptnormsuperscript𝒘𝑡superscript𝒘𝑡11\sum_{t=1}^{T-1}\|\boldsymbol{w}^{t}-\boldsymbol{w}^{t+1}\|_{1}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the fused Lasso penalty, and the combination of 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT and 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT is also known as the sparse group Lasso penalty (simon2013sparse). Thus, this problem with three non-smooth regularization terms can be solved by AGM via computing the decoupled proximal operator.

Multi-Stage Multi-Task Feature Learning (gong2012multi) represents a pioneering approach to address the sub-optimal solutions observed in prior convex sparse regularization problems. This sub-optimality can be attributed to the challenges in approximating 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT regularization. In response to this limitation, the method introduces a non-convex formulation utilizing capped 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT regularization for MTL:

(13) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λd=1Dmin{𝒘d1,τ},subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑1𝜏\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% \sum\limits_{d=1}^{D}\min\{\|\boldsymbol{w}_{d}\|_{1},\tau\},roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT roman_min { ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ } ,

where τ𝜏\tauitalic_τ is a threshold to tailor the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of weight vectors, i.e., {𝒘d}d=1Dsuperscriptsubscriptnormsubscript𝒘𝑑𝑑1𝐷\{\|\boldsymbol{w}_{d}\|\}_{d=1}^{D}{ ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ } start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT corresponding to each feature, and the term d=1Dmin{𝒘d1,τ}superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑1𝜏\sum_{d=1}^{D}\min\{\|\boldsymbol{w}_{d}\|_{1},\tau\}∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT roman_min { ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ } is a natural generalization of capped 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm in zhang2010analysis; zhang2013multi. To solve this non-convex problem (13), gong2012multi proposed an efficient algorithm and investigated the estimation error bound of the resulting estimator.

Remarks (i) Feature selection can highlight task relatedness, especially in scenarios with limited data availability (##\##feature >>> ##\##data). (ii) The 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-series regularization easily facilitates feature selection, offering broad generalizability across various parametric models in MTL. (iii) In MTL contexts with plenty of training resources, feature selection might compromise performance; however, it enhances interpretability through the selected features. (iv) In situations with limited data, certain feature selection techniques may become vulnerable to minor data variations, which can potentially impact the stability of the learning process.
Refer to caption
(a) Feedforward Neural Networks.
Refer to caption
(b) Recurrent Neural Networks.
Figure 7. Hard-parameter sharing in FNNs and RNNs. (a) The most early version of hard parameters sharing. The connections between inputs and hidden neurons jointly transform features, which are then utilized for Task 1 to Task T𝑇Titalic_T. (b) A modern-day RNN used for multiple-target language translation, which jointly transforms features from shared sequence-based representations. (h1,,hL𝒙)subscript1subscriptsubscript𝐿𝒙(h_{1},\cdots,h_{L_{\boldsymbol{x}}})( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_h start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) represent the sequence of bidirectional recurrent representations, where L𝒙subscript𝐿𝒙L_{\boldsymbol{x}}italic_L start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT is the number of tokens for the source sentence 𝒙𝒙\boldsymbol{x}bold_italic_x. si(t)superscriptsubscript𝑠𝑖𝑡s_{i}^{(t)}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is is a recurrent neural network hidden state at time i𝑖iitalic_i for the t𝑡titalic_t-th task, which is estimated based on the combination of (h1,,hL𝒙)subscript1subscriptsubscript𝐿𝒙(h_{1},\cdots,h_{L_{\boldsymbol{x}}})( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_h start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) weighted by A(t)superscript𝐴𝑡A^{(t)}italic_A start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.

2.1.2. Feature Transformation

Unlike the sparse learning methods discussed in §§\S§2.1.1, which assume direct use of observed features, feature transformation methods aim to combine and transform–rather than simply select–the raw features into new representations. This approach enables handling coarse-grained input data. Sparse learning in MTL builds task relatedness into model f()𝑓f(\cdot)italic_f ( ⋅ ) through sharing similar weight structure across multiple tasks, however, feature learning in MTL makes tasks be related to each other via enforcing a common underlying representation (argyriou2006multi). For example, yu2019towards points out that two tasks of aesthetic quality assessment and emotional recognition in digital image analysis share similar feature representations. Another example from caruna1993multitask; caruana1997multitask, as shown in Fig. 6(a), reveals that different tasks can synchronously learn from the same feature encodings in feedforward neural networks (FNNs).

Multi-Task Feature Learning (argyriou2006multi) linearly combines observations/features via introducing a transformation matrix 𝑼𝑶D𝑼superscript𝑶𝐷\boldsymbol{U}\in\boldsymbol{O}^{D}bold_italic_U ∈ bold_italic_O start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, which can be extended to nonlinear combinations by using kernel methods. As formulated in the following,

(14) min𝑼,𝑾12t=1T1Nt(𝑿(t)𝑼)𝒘t𝒚(t)22+λ(d=1D𝒘d2)2,s.t.𝑼𝑶D,formulae-sequencesubscript𝑼𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡𝑼superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsuperscriptsubscript𝑑1𝐷subscriptnormsubscript𝒘𝑑22𝑠𝑡𝑼superscript𝑶𝐷\min\limits_{\boldsymbol{U},\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}% \frac{1}{N_{t}}\|({\boldsymbol{X}^{(t)}}\boldsymbol{U})\boldsymbol{w}^{t}-% \boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda(\sum\limits_{d=1}^{D}{\|\boldsymbol{w}_% {d}\|}_{2})^{2},\quad s.t.~{}\boldsymbol{U}\in\boldsymbol{O}^{D},roman_min start_POSTSUBSCRIPT bold_italic_U , bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ ( bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_U ) bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ( ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_s . italic_t . bold_italic_U ∈ bold_italic_O start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ,

we need to estimate 𝑼𝑼\boldsymbol{U}bold_italic_U and 𝑾𝑾\boldsymbol{W}bold_italic_W from the data. The 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm imposed on 𝑾𝑾\boldsymbol{W}bold_italic_W ensures that the transformed features, i.e., 𝑿(t)𝑼superscript𝑿𝑡𝑼{\boldsymbol{X}^{(t)}}\boldsymbol{U}bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_U, with a fixed 𝑼𝑼\boldsymbol{U}bold_italic_U, would be collectively selected across tasks. To learn the transformed features, argyriou2006multi fixed 𝑾𝑾\boldsymbol{W}bold_italic_W to minimize the objective function (14) over 𝑼𝑼\boldsymbol{U}bold_italic_U under the orthogonal constraints. Even with this two-step iterated optimization algorithm to solve for 𝑾𝑾\boldsymbol{W}bold_italic_W and 𝑼𝑼\boldsymbol{U}bold_italic_U, solving the problem (14) is still a non-convex problem. Accordingly, it is transformed into an equivalent convex problem333It is also known as convex multi-task feature learning (argyriou2006multi; argyriou2008convex), which is mentioned in argyriou2006multi and further discussed in argyriou2008convex with the learning of non-linear features using kernel methods. as follows.

min𝑽,𝑾subscript𝑽𝑾\displaystyle\min\limits_{\boldsymbol{V},\boldsymbol{W}}roman_min start_POSTSUBSCRIPT bold_italic_V , bold_italic_W end_POSTSUBSCRIPT 12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λt=1T𝒘t𝑽+𝒘t,12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑡1𝑇superscriptsuperscript𝒘𝑡topsuperscript𝑽superscript𝒘𝑡\displaystyle\frac{1}{2}\sum\limits_{t=1}^{T}\frac{1}{N_{t}}\|{\boldsymbol{X}^% {(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda\sum\limits_{t=% 1}^{T}{\boldsymbol{w}^{t}}^{\top}\boldsymbol{V}^{+}\boldsymbol{w}^{t},divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ,
(15) s.t.formulae-sequence𝑠𝑡\displaystyle~{}s.t.~{}italic_s . italic_t . 𝑽𝑺+D,tr(𝑽)1,col(𝑾)col(𝑽).formulae-sequence𝑽superscriptsubscript𝑺𝐷formulae-sequencetr𝑽1col𝑾col𝑽\displaystyle\boldsymbol{V}\in\boldsymbol{S}_{+}^{D},\text{tr}(\boldsymbol{V})% \leq 1,\text{col}(\boldsymbol{W})\subseteq\text{col}(\boldsymbol{V}).bold_italic_V ∈ bold_italic_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , tr ( bold_italic_V ) ≤ 1 , col ( bold_italic_W ) ⊆ col ( bold_italic_V ) .

dong2015multi first extends the neural machine translation to an MTL framework which shares a bidirectional recurrent representation with forward and backward sequence information, as shown in Fig. 6(b). Suppose we have T𝑇Titalic_T different language pairs {(𝐱(t),𝒚(t))}t=1Tsuperscriptsubscriptsuperscript𝐱𝑡superscript𝒚𝑡𝑡1𝑇\{(\mathbf{x}^{(t)},\boldsymbol{y}^{(t)})\}_{t=1}^{T}{ ( bold_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, for instance, from English to many other languages like French, Spanish, Dutch, and Portuguese, the probability of generating each translated word at time step i𝑖iitalic_i is

(16) p(yi(t)|y1(t),,yi1(t),𝒙(t))=f(yi1(t),si(t),ci(t)),t=1,,T,formulae-sequence𝑝conditionalsuperscriptsubscript𝑦𝑖𝑡superscriptsubscript𝑦1𝑡superscriptsubscript𝑦𝑖1𝑡superscript𝒙𝑡𝑓superscriptsubscript𝑦𝑖1𝑡superscriptsubscript𝑠𝑖𝑡superscriptsubscript𝑐𝑖𝑡𝑡1𝑇p(y_{i}^{(t)}|y_{1}^{(t)},\cdots,y_{i-1}^{(t)},\boldsymbol{x}^{(t)})=f(y_{i-1}% ^{(t)},s_{i}^{(t)},c_{i}^{(t)}),t=1,\cdots,T,italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = italic_f ( italic_y start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

where f𝑓fitalic_f is parameterized by a FNN, si(t)superscriptsubscript𝑠𝑖𝑡s_{i}^{(t)}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the hidden state of a recurrent neural network at time step i𝑖iitalic_i, and ci(t)superscriptsubscript𝑐𝑖𝑡c_{i}^{(t)}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is a context vector calculated from a sequence of annotations (h1,,hL𝒙)subscript1subscriptsubscript𝐿𝒙(h_{1},\cdots,h_{L_{\boldsymbol{x}}})( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_h start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), which is mapped from the original sentence 𝒙𝒙\boldsymbol{x}bold_italic_x by an encoder. More details of bidirectional sequence learning please refer to dong2015multi. After that, all annotations hj(j=1,,L𝒙)subscript𝑗𝑗1subscript𝐿𝒙h_{j}~{}(j=1,\cdots,L_{\boldsymbol{x}})italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_j = 1 , ⋯ , italic_L start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ) are collectively transformed by soft alignment parameters A(t)(t=1,,T)superscript𝐴𝑡𝑡1𝑇A^{(t)}~{}(t=1,\cdots,T)italic_A start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_t = 1 , ⋯ , italic_T ) for each encoder-decoder to achieve cross-task communications.

Remarks (i) Feature transformation can facilitate multiple tasks to share the same underlying representations. (ii) The features from different tasks can interact with each other, providing mutual benefits across all tasks.

2.1.3. Low-Rank Factorization

In MTL, as discussed before, information sharing among multiple tasks can be achieved by assuming that all the tasks are impacted by the same small subset of predictors. On the other hand, low-rank structures imposed on the coefficient matrices or tensors can induce a different type of information sharing among tasks, i.e., the tasks are affected by the predictors through a shared small set of latent variables or directions, which are extracted from the original feature space and are the most relevant subspace to the outcomes. Depending on the way of indexing multiple learning tasks, one can choose to organize the coefficient vectors from multiple learning tasks into a matrix of dimension D×T𝐷𝑇D\times Titalic_D × italic_T or a tensor with a more delicate structure. In general, the multi-dimensional indices of tasks commonly imply that there are multi-layer relationships among multiple tasks, and the tensor form can help keep this inherent structure which allows leveraging information from different dimensions of task similarities.

Matrix Factorization

The most commonly seen situation is when we organize the coefficient vectors from multiple tasks into a matrix 𝑾𝑾\boldsymbol{W}bold_italic_W, and the rank penalized problem can be formulated as

(17) min𝑾t=1T(t)(f(𝑿(t),𝒘t),𝒚(t))+λrank(𝑾).subscript𝑾superscriptsubscript𝑡1𝑇superscript𝑡𝑓superscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡𝜆rank𝑾\min_{\boldsymbol{W}}\sum\limits_{t=1}^{T}\mathcal{L}^{(t)}\left(f(\boldsymbol% {X}^{(t)},\boldsymbol{w}^{t}),\boldsymbol{y}^{(t)}\right)+\lambda~{}\text{rank% }(\boldsymbol{W}).roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_f ( bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) + italic_λ rank ( bold_italic_W ) .

However, to minimize the rank of a matrix is NP-hard (vandenberghe1996semidefinite) due to the combinatorial nature of the rank function (ji2009accelerated; han2016multi). An alternative is to substitute the rank penalty with the trace of the rank for the symmetric positive semidefinite matrix (mesbahi1999semi), but it excludes non-symmetric or even non-square matrices in real-world applications. fazel2001rank generalized the trace heuristic to any matrix by introducing the trace norm (a.k.a, nuclear norm or Ky-Fun k-norm) (horn2012matrix), which is defined as the sum of a matrix’s all singular values (See Table 3).

Low Rank Multi-Task Learning (ji2009accelerated) first introduces the trace norm optimization problem into MTL, which yields a low-rank solution that maps to a low-dimensional feature subspace. The problem can be written as

(18) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ𝑾,subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆subscriptnorm𝑾\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% \|\boldsymbol{W}\|_{*},roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ bold_italic_W ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ,

where \|\cdot\|_{*}∥ ⋅ ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT denotes the trace norm of the weight matrix 𝑾𝑾\boldsymbol{W}bold_italic_W. The technical challenge for the problem above is the non-smooth nature of the trace norm, which makes it converge slowly (O(1k),k𝑂1𝑘𝑘O(\frac{1}{\sqrt{k}}),kitalic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_k end_ARG end_ARG ) , italic_k is the iterations). ji2009accelerated developed an accelerated gradient method that boosts the learning process of trace norm minimization from O(1k)𝑂1𝑘O(\frac{1}{\sqrt{k}})italic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_k end_ARG end_ARG ) to O(1k)𝑂1𝑘O(\frac{1}{k})italic_O ( divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ), even to O(1k2)𝑂1superscript𝑘2O(\frac{1}{k^{2}})italic_O ( divide start_ARG 1 end_ARG start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) with the help of Nesterov’s method (nesterov1983method). It is noticed that a dual reformation (pong2010trace) of problem (18) can make it more solvable. In fact, both the rank penalty and the trace norm can be written in a more general form r=1min(D,T)ρ(σr(𝑾))superscriptsubscript𝑟1𝐷𝑇𝜌subscript𝜎𝑟𝑾\sum_{r=1}^{\min(D,T)}\rho(\sigma_{r}(\boldsymbol{W}))∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min ( italic_D , italic_T ) end_POSTSUPERSCRIPT italic_ρ ( italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_W ) ) where ρ()𝜌\rho(\cdot)italic_ρ ( ⋅ ) is a penalty function and σr(𝑾)subscript𝜎𝑟𝑾\sigma_{r}(\boldsymbol{W})italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_W ) is the r𝑟ritalic_r-th largest singular value of 𝑾𝑾\boldsymbol{W}bold_italic_W. When ρ(σr(𝑾))=I(σr(𝑾)0)𝜌subscript𝜎𝑟𝑾𝐼subscript𝜎𝑟𝑾0\rho(\sigma_{r}(\boldsymbol{W}))=I(\sigma_{r}(\boldsymbol{W})\neq 0)italic_ρ ( italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_W ) ) = italic_I ( italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_W ) ≠ 0 ), where I()𝐼I(\cdot)italic_I ( ⋅ ) is the indicator function, we get the rank penalty which is also the 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm of the singular values. When ρ(σr(𝑾))=σr(𝑾)𝜌subscript𝜎𝑟𝑾subscript𝜎𝑟𝑾\rho(\sigma_{r}(\boldsymbol{W}))=\sigma_{r}(\boldsymbol{W})italic_ρ ( italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_W ) ) = italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_W ), we get the nuclear norm penalty, i.e., the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of the singular values. For 0h1010\leq h\leq 10 ≤ italic_h ≤ 1, the properties of the hsubscript\ell_{h}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT norm of the singular values, i.e., the Schatten-hhitalic_h quasi-norm penalty, have been investigated in rohde2011estimation.

Instead of using different power functions of singular values as penalty functions, there are some other variants of the nuclear norm penalty that can lead to more delicate learning of a low-rank matrix.

The rank of a matrix is defined by the count of its non-zero singular values, meaning that a lower rank corresponds to fewer non-zero singular values. Unlike penalizing all singular values, which the trace norm avoids, it is more desirable and reasonable. This is because the trace norm specifically shrinks only small singular values toward zero, contributing to a more focused and effective regularization approach. To leave the larger singular values un-penalized, Reduced Rank Multi-Stage Multi-Task Learning (RAMUSA) (han2016multi) considers the objective function with truncated trace norm (zhang2012matrix) as

(19) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λr=1min(D,T)min{σr(𝑾),τ}.subscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑟1𝐷𝑇subscript𝜎𝑟𝑾𝜏\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda% \sum\limits_{r=1}^{\min(D,T)}\min\{\sigma_{r}(\boldsymbol{W}),\tau\}.roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min ( italic_D , italic_T ) end_POSTSUPERSCRIPT roman_min { italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_W ) , italic_τ } .

The parameter τ𝜏\tauitalic_τ serves as a threshold of the singular value magnitude, and only those singular values smaller than τ𝜏\tauitalic_τ will get penalized. When τ𝜏\tau\rightarrow\inftyitalic_τ → ∞, problem (19) is reduced to the low-rank MTL problem (18). To address this non-convex problem, han2016multi introduce a multi-stage algorithm designed to learn a surrogate upper-bound function. Theoretical proofs affirm its capability for shrinkage, making it an effective approach to tackle the non-convex optimization challenge.

An alternative to the truncated trace norm to relieve the shrinkage on large singular values is the adaptive nuclear norm penalization λr=1R=min(D,T)αrσr(𝑾)𝜆superscriptsubscript𝑟1𝑅𝐷𝑇subscript𝛼𝑟subscript𝜎𝑟𝑾\lambda\sum_{r=1}^{R=\min(D,T)}\alpha_{r}\sigma_{r}(\boldsymbol{W})italic_λ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R = roman_min ( italic_D , italic_T ) end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_W ) proposed by chen2013reduced. The weights {αr}r=1Rsuperscriptsubscriptsubscript𝛼𝑟𝑟1𝑅\{\alpha_{r}\}_{r=1}^{R}{ italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT are used to adjust for the level of penalization on each singular value, which should be non-negative values and satisfy α1αRsubscript𝛼1subscript𝛼𝑅\alpha_{1}\leq\ldots\leq\alpha_{R}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ … ≤ italic_α start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT. The explanation is straightforward, i.e., the larger weights on the smaller singular values ensure a greater shrinkage towards 0, while the smaller weights on the larger singular values are helpful in reducing the shrinkage magnitude.

Low-rank methods are useful to achieve dimension reduction by learning a small set of latent variables. However, low-rank methods alone cannot identify which variables are truly predictive of the outcomes. To obtain a more interpretable model, one can assume that not all predictors are affecting the outcomes by adding a sparsity-inducing penalty in addition to a low-rank restriction. In the field of statistics, this line of research has received lots of attention, and variable selection can be achieved by adding a row-wise penalization on the coefficient matrix in a rank-restricted model. For example, chen2012sparse apply a group-lasso type penalty on the rows of the coefficient matrix. Similar work include bunea2012joint and she2017selective. One of the other forms of sparsity structure considered in low-rank models is sparse SVD discussed in chen2012reduced and uematsu2019sofar. Sparse SVD achieves predictor and response selection simultaneously. With a rank r𝑟ritalic_r, SVD dissects the correlation between responses and predictors, i.e., the coefficient matrix, into r𝑟ritalic_r orthogonal channels. The importance of each channel is measured by a singular value, and within each channel, the weights on predictors (responses) are in the corresponding right (left) singular vectors. The sparse SVD can achieve both SVD layer-specific sparsity pattern, by imposing sparsity on elements of each singular vector to find different subsets of predictors/responses that are making effects in each correlation pathway (chen2012reduced), and global variable selection, by shrinking all weights related to a certain variable contained in singular vectors to be zeroes (uematsu2019sofar).

Tensor Factorization

When we have multiple learning tasks that can be indexed by multi-dimensional indices, instead of stacking all the weight vectors into a matrix of dimension features ×\times× tasks, keeping the structure of the index of tasks by saving the weight vectors into a tensor leads to MultiLinear Multi-Task learning (MLMT) (wimalawarne2014multitask). MLMT brings us with several advantages compared with the conventional MTL. Firstly, it allows us to keep the inherent structure of the learning tasks so that different dimensions of task similarities can be learned, and the higher-order structures among tasks can be recovered as well. What’s more, task imputation (i.e., TL) is made available with MLMT for tasks with no training data (wimalawarne2014multitask). The learning problem can be written as

(20) min𝓦t=1T1Nt𝑿(t)𝒘t𝒚(t)22subscript𝓦superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22\min\limits_{\boldsymbol{\mathcal{W}}}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}roman_min start_POSTSUBSCRIPT bold_caligraphic_W end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

where 𝓦D×I2××IN𝓦superscript𝐷subscript𝐼2subscript𝐼𝑁\boldsymbol{\mathcal{W}}\in\mathbb{R}^{D\times I_{2}\times\cdots\times I_{N}}bold_caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × ⋯ × italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a tensor consisting of learning weights 𝒘tDsuperscript𝒘𝑡superscript𝐷\boldsymbol{w}^{t}\in\mathbb{R}^{D}bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, and the total number of tasks T=j=2NIj𝑇superscriptsubscriptproduct𝑗2𝑁subscript𝐼𝑗T=\prod_{j=2}^{N}I_{j}italic_T = ∏ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

To exploit task similarities at each dimension, similar to low-rank matrix-based MTL, a multilinear rank restriction can be imposed on the weight tensor. In romera2013multilinear, the authors directly incorporated the rank restriction into the learning task by using a low-rank Tucker decomposition (kolda2009tensor) of the weight tensor, and the Frobenius norms of Tucker decomposition components are added as regularizations to reduce overfitting. This optimization problem is solved by alternating minimization.

Alternatively, tensor trace norms are commonly used as a convex approximation to rank restrictions. However, not like the matrix rank, since a tensor rank has no unique definition, various trace norms are developed to fulfill different analysis demands for different anticipated information sharing mechanisms among tasks (zhang2022learning). With R(𝓦)𝑅𝓦R(\boldsymbol{\mathcal{W}})italic_R ( bold_caligraphic_W ) denoting a tensor trace norm, the learning task is

(21) min𝓦12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λR(𝓦)subscript𝓦12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆𝑅𝓦\min\limits_{\boldsymbol{\mathcal{W}}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}% \|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2}_{2}+% \lambda R(\boldsymbol{\mathcal{W}})roman_min start_POSTSUBSCRIPT bold_caligraphic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ italic_R ( bold_caligraphic_W )

where λ𝜆\lambdaitalic_λ is the tuning parameter to control the magnitude of penalization.

In general, in the sense of Tucker decomposition or multi-linear SVD (tomioka2013convex; kolda2009tensor), tensor trace norms include two categories: the overlapped tensor trace norms and the latent tensor trace norms. The latent trace norm (tomioka2013convex; wimalawarne2014multitask) can be written as

(22) 𝓦,latent=inf𝓦(1)++𝓦(N)=𝓦k=1N𝑾(k)(k)subscriptnorm𝓦𝑙𝑎𝑡𝑒𝑛𝑡subscriptinfimumsuperscript𝓦1superscript𝓦𝑁𝓦superscriptsubscript𝑘1𝑁subscriptnormsuperscriptsubscript𝑾𝑘𝑘\|\boldsymbol{\mathcal{W}}\|_{*,latent}=\inf_{\boldsymbol{\mathcal{W}}^{(1)}+% \cdots+\boldsymbol{\mathcal{W}}^{(N)}=\boldsymbol{\mathcal{W}}}\sum_{k=1}^{N}% \|\boldsymbol{W}_{(k)}^{(k)}\|_{*}∥ bold_caligraphic_W ∥ start_POSTSUBSCRIPT ∗ , italic_l italic_a italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT = roman_inf start_POSTSUBSCRIPT bold_caligraphic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + ⋯ + bold_caligraphic_W start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT = bold_caligraphic_W end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ bold_italic_W start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT

where 𝓦(k)superscript𝓦𝑘\boldsymbol{\mathcal{W}}^{(k)}bold_caligraphic_W start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT are latent tensors of 𝓦𝓦\boldsymbol{\mathcal{W}}bold_caligraphic_W and 𝑾(k)(k)superscriptsubscript𝑾𝑘𝑘\boldsymbol{W}_{(k)}^{(k)}bold_italic_W start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT denotes a flattened tensor 𝓦(k)superscript𝓦𝑘\boldsymbol{\mathcal{W}}^{(k)}bold_caligraphic_W start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT along its k𝑘kitalic_kth axis. Thus, the latent trace norm is the infimum of the summation of the matrix trace norm of flattened latent tensors of 𝓦𝓦\boldsymbol{\mathcal{W}}bold_caligraphic_W. To account for the heterogenous multilinear rank and dimensions, wimalawarne2014multitask propose a scaled latent trace norm by adding a weight Ik1/2superscriptsubscript𝐼𝑘12I_{k}^{-1/2}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT to each component 𝑾(k)(k)subscriptnormsuperscriptsubscript𝑾𝑘𝑘\|\boldsymbol{W}_{(k)}^{(k)}\|_{*}∥ bold_italic_W start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. It can identify the dimension with the lowest rank rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT relative to its dimensionality Iksubscript𝐼𝑘I_{k}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The overlapped tensor trace norm (romera2013multilinear) of a tensor is defined as the weighted sum of nuclear norm of its flattened tensors. With different ways of tensor flattening, the overlapped tensor trace norms have different forms, including the Tucker trace norm (romera2013multilinear) that is a convex combination of matrix trace norms of tensor flattening along each axis in the tensor and the Tensor-Train trace norm (oseledets2011tensor) that conducts tensor flattening along successive axes starting from the first axis. Given that the feature representation can be factorized into semantic basis vectors and linear coefficients mapping the basis vector space to the original feature vector space, yang2016deep introduce the utilization of low-rank tensors in MTL through deep representation learning.

Since most of the overlapped tensor trace norms only make use a subset of all possible flattening of a tensor that reflect different beliefs of the information sharing mechanism among tasks, to search for all the low-rank structures in a weight tensor and unify various overlapped tensor trace norms, zhang2022learning propose a Generalized Tensor Trace Norm (GTTN) which is the convex sum of matrix trace norms of all possible tensor flattening. The combination weights of matrix trace norms of tensor flattenings are treated as unknown variables in the optimization problem to accommodate different levels of importance of each flattening.

When nonlinear low-rank structures among tasks are expected to achieve better learning performance, zhang2022learning propose the nonlinear GTTN that firstly transforms the rows or columns of each flattened tensor nonlinearly via a neural network and then performs GTTN on the transformed parameters to capture the nonlinear low-rank structure among all the tasks. For models that are nonlinear in the data, signoretto2013learning also provide a kernel-based method for MLMT.

Remarks (i) Low-rank structures can achieve both information sharing among tasks and dimension reduction by enforcing all the tasks being affected by the same small set of latent variables extracted from the original feature space. (ii) Sparsity-inducing penalties can be added in addition to the rank restriction to achieve variable selection. (iii) Keeping the multi-dimensional indices of multiple tasks by saving the weight vectors into a tensor allows us to keep the inherent structure of the learning tasks so that: a. different dimensions of task similarities can be learned; b. the higher-order structures among tasks can be recovered; c. task imputation is made available for tasks with no training data.

2.1.4. Decomposition

Task-relatedness can be learned based on the assumption that similar tasks share the same non-zero elements, and these tasks can acquire richer representations through transformation or low-rank regularization. The decomposition methods discussed in this section aim to capture multiple aspects of task-relatedness, such as sparsity and low-rankness, by decomposing model weights into a sum or product of distinct components. These components not only capture shared information but also task-specific information that benefits each task. The flexibility of decomposition techniques provides deeper insights into the nature of multitasking, enabling exploration of various combinations of regularizers suitable for different types of multitasking, including the incorporation of irrelevant or outlier tasks. However, decomposition methods have a limitation. The regularization applied to complex components may lead to non-smooth optimization problems involving a large number of variables, which can pose challenges in efficiently solving the devised decomposition problem. In the MTL setting, the general formalization of decomposition problems can be expressed as

min𝑾subscript𝑾\displaystyle\min_{\boldsymbol{W}}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT t=1T(t)(f(𝑿(t),𝒘t),𝒚(t))+λ1reg1(𝑷)+λ2reg2(𝑸),superscriptsubscript𝑡1𝑇superscript𝑡𝑓superscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡subscript𝜆1subscriptreg1𝑷subscript𝜆2subscriptreg2𝑸\displaystyle\sum\limits_{t=1}^{T}\mathcal{L}^{(t)}\left(f(\boldsymbol{X}^{(t)% },\boldsymbol{w}^{t}),\boldsymbol{y}^{(t)}\right)+\lambda_{1}~{}\text{reg}_{1}% (\boldsymbol{P})+\lambda_{2}~{}\text{reg}_{2}(\boldsymbol{Q}),∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_f ( bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_P ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_Q ) ,
(23) s.t.formulae-sequence𝑠𝑡\displaystyle s.t.~{}italic_s . italic_t . 𝑾=𝑷+𝑸oder𝑾=𝑷𝑸,𝑾𝑷𝑸oder𝑾𝑷𝑸\displaystyle\boldsymbol{W}=\boldsymbol{P}+\boldsymbol{Q}~{}\text{or}~{}% \boldsymbol{W}=\boldsymbol{P}\cdot\boldsymbol{Q},bold_italic_W = bold_italic_P + bold_italic_Q or bold_italic_W = bold_italic_P ⋅ bold_italic_Q ,

where the reg1subscriptreg1\text{reg}_{1}reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and reg2subscriptreg2\text{reg}_{2}reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are regularizers for the learning of different task-relatedness.

Form “𝑷+𝑸𝑷𝑸\boldsymbol{P}+\boldsymbol{Q}bold_italic_P + bold_italic_Q

The Dirty Block-Sparse Model (jalali2010dirty) is introduced by recognizing that block-sparsity regularizers (p,1subscript𝑝1\ell_{p,1}roman_ℓ start_POSTSUBSCRIPT italic_p , 1 end_POSTSUBSCRIPT) are influenced by the degree of feature overlap among tasks. Acknowledging the prevalence of dirty high-dimensional data444It refers to data that are not only high-dimensional (containing a large number of features or attributes) but also contain errors, inaccuracies, or misleading information. in many multi-task scenarios, this model adeptly addresses the challenges posed by explicitly permitting the decomposition of the weight matrix into element-wise sparse and block-sparse components:

(24) min𝑾12t=1T1Nt𝑿(t)(𝒔t+𝒃t)𝒚(t)22+λ1d=1D𝒔d1+λ2d=1D𝒃d,s.t.𝑾=𝑺+𝑩,formulae-sequencesubscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒔𝑡superscript𝒃𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒔𝑑1subscript𝜆2superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒃𝑑𝑠𝑡𝑾𝑺𝑩\displaystyle\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac% {1}{N_{t}}\|{\boldsymbol{X}^{(t)}}(\boldsymbol{s}^{t}+\boldsymbol{b}^{t})-% \boldsymbol{y}^{(t)}\|^{2}_{2}+\lambda_{1}\sum\limits_{d=1}^{D}{\|\boldsymbol{% s}_{d}\|}_{1}+\lambda_{2}\sum\limits_{d=1}^{D}{\|\boldsymbol{b}_{d}\|}_{\infty% },\quad s.t.~{}\boldsymbol{W}=\boldsymbol{S}+\boldsymbol{B},roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_b start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , italic_s . italic_t . bold_italic_W = bold_italic_S + bold_italic_B ,

where the 𝒔tsuperscript𝒔𝑡\boldsymbol{s}^{t}bold_italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and 𝒃tsuperscript𝒃𝑡\boldsymbol{b}^{t}bold_italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are the t𝑡titalic_t-th columns of 𝑺𝑺\boldsymbol{S}bold_italic_S and 𝑩𝑩\boldsymbol{B}bold_italic_B, respectively. The 1,1subscript11\ell_{1,1}roman_ℓ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT norm learns an uneven sparse structure (obozinski2006multi; zhang2006a) while ,1subscript1\ell_{\infty,1}roman_ℓ start_POSTSUBSCRIPT ∞ , 1 end_POSTSUBSCRIPT norm guarantees features that admit block-wise sparsity to be learned collectively across tasks (zhang2006a)jalali2010dirty proves that Eq. (24) can match Lasso (1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) for no-sharing STL and ,1subscript1\ell_{\infty,1}roman_ℓ start_POSTSUBSCRIPT ∞ , 1 end_POSTSUBSCRIPT for fully-sharing MTL, and it strictly outperforms both methods elsewhere, including the dirty setting.

Robust Multi-Task Feature Learning (rMTFL) (gong2012robust) can capture the task-shared features among relevant tasks and identify outlier tasks simultaneously. Specifically, the weight matrix for all tasks is first decomposed into two components. And then, gong2012robust impose the well-known 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT penalty on the first component and the 1,2subscript12\ell_{1,2}roman_ℓ start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT penalty on the second component. Formally, the proposed rMTFL can be formulated as

(25) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1D𝒑d2+λ2d=1D𝒒d12,s.t.𝑾=𝑷+𝑸,formulae-sequencesubscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒑𝑑2subscript𝜆2superscriptsubscript𝑑1𝐷superscriptsubscriptnormsubscript𝒒𝑑12𝑠𝑡𝑾𝑷𝑸\displaystyle\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac% {1}{N_{t}}\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2% }_{2}+\lambda_{1}\sum\limits_{d=1}^{D}\|\boldsymbol{p}_{d}\|_{2}+\lambda_{2}% \sqrt{\sum\limits_{d=1}^{D}\|\boldsymbol{q}_{d}\|_{1}^{2}},\quad s.t.~{}% \boldsymbol{W}=\boldsymbol{P}+\boldsymbol{Q},roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_s . italic_t . bold_italic_W = bold_italic_P + bold_italic_Q ,

where the penalty applied to the rows of the weight matrices captures shared information, as it selects the same non-zero elements across all tasks. Simultaneously, the penalty on the columns enforces the weights for outlier tasks to be constrained to zero. In gong2012robust, a theoretical bound is established to quantify the approximation accuracy of the optimization in relation to the true evaluation. Additionally, error bounds between the estimated weights of rMTFL and the underlying true weights are provided. It is important to note that this method is specifically applicable to MTL settings where some of the tasks are considered outliers.

Robust Multi-Task Learning (RMTL) (chen2011integrating) addresses real-world applications where certain tasks are irrelevant to other aggregated groups in MTL, impacting the learning performance of different tasks. RMTL is designed to capture task relatedness by learning a low-rank structure while identifying outlier tasks. This approach draws inspiration from previous research on group sparsity (obozinski2006multi; lee2010adaptive). It is formulated as a non-smooth convex optimization problem as

(26) min𝑾12t=1T1Nt𝑿(t)(𝒑t+𝒒t)𝒚(t)22+λ1𝑷+λ2t=1T𝒒t2,s.t.𝑾=𝑷+𝑸.formulae-sequencesubscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡superscriptsubscriptnormsuperscript𝑿𝑡superscript𝒑𝑡superscript𝒒𝑡superscript𝒚𝑡22subscript𝜆1subscriptnorm𝑷subscript𝜆2superscriptsubscript𝑡1𝑇subscriptnormsubscript𝒒𝑡2𝑠𝑡𝑾𝑷𝑸\displaystyle\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac% {1}{N_{t}}\|{\boldsymbol{X}^{(t)}}(\boldsymbol{p}^{t}+\boldsymbol{q}^{t})-% \boldsymbol{y}^{(t)}\|_{2}^{2}+\lambda_{1}\|\boldsymbol{P}\|_{*}+\lambda_{2}% \sum\limits_{t=1}^{T}\|\boldsymbol{q}_{t}\|_{2},\quad s.t.~{}\boldsymbol{W}=% \boldsymbol{P}+\boldsymbol{Q}.roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + bold_italic_q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_P ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s . italic_t . bold_italic_W = bold_italic_P + bold_italic_Q .

Different from feature selection techniques, 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm here is imposed on the columns of the weight matrix. This penalty aims to learn group sparsity of different tasks across all features. It enforces that the weights associated with outlier tasks are constrained to approach zero, thereby diminishing the negative influence of outlier tasks. The low-rank structure encoded in RMTL encapsulates the positive effectiveness, mitigating the impact of outlier tasks. This differs from hsu2010robust that focuses on learning both low-rank and sparse structures and provides a theoretically established and unique decomposition. RMTL, on the other hand, simultaneously learns both the low-rank and task-wise sparse structures through an accelerated proximal method (APM) (nemirovski1994efficient; nesterov1998introductory). The performance bound of this integrated approach is also proven.

Sparse and Low-Rank Multi-Task Learning (chen2012learning) also decomposes the weight matrix into a low-rank component and a sparse component. Unlike chen2011integrating that jointly optimizes both structures in the objective function, chen2012learning uses a trace norm constraint to implicitly encourage the low-rank structure. The formulation is

(27) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λd=1D𝒑d1,s.t.𝑾=𝑷+𝑸,𝑸τ.formulae-sequencesubscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑑1𝐷subscriptnormsubscript𝒑𝑑1𝑠𝑡formulae-sequence𝑾𝑷𝑸subscriptnorm𝑸𝜏\displaystyle\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac% {1}{N_{t}}\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2% }_{2}+\lambda\sum\limits_{d=1}^{D}\|\boldsymbol{p}_{d}\|_{1},\quad s.t.~{}% \boldsymbol{W}=\boldsymbol{P}+\boldsymbol{Q},\|\boldsymbol{Q}\|_{*}\leq\tau.roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s . italic_t . bold_italic_W = bold_italic_P + bold_italic_Q , ∥ bold_italic_Q ∥ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≤ italic_τ .

It is proved to be the tightest convex surrogate function to the non-convex NP-hard problem with a cardinality regularization term (0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm) and a low-rank constraint. A general projected gradient scheme (boyd2004convex) is applied to solve this relaxed convex problem (27), which can also be accelerated using Nesterov’s method (nesterov1998introductory).

Form “𝑷𝑸𝑷𝑸\boldsymbol{P}\cdot\boldsymbol{Q}bold_italic_P ⋅ bold_italic_Q

Alternating Structure Optimization (ASO) (ando2005framework) aims to facilitate structural learning from multiple tasks. By introducing an auxiliary variable 𝒖(t)superscript𝒖𝑡\boldsymbol{u}^{(t)}bold_italic_u start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT for each task t𝑡titalic_t such that 𝒖(t)=𝒘(t)+Θ𝒗(t)superscript𝒖𝑡superscript𝒘𝑡superscriptΘtopsuperscript𝒗𝑡\boldsymbol{u}^{(t)}=\boldsymbol{w}^{(t)}+\Theta^{\top}\boldsymbol{v}^{(t)}bold_italic_u start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + roman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, the problem is formulated as

min{𝑾,𝑽},Θsubscript𝑾𝑽Θ\displaystyle\min\limits_{\{\boldsymbol{W},\boldsymbol{V}\},\Theta}roman_min start_POSTSUBSCRIPT { bold_italic_W , bold_italic_V } , roman_Θ end_POSTSUBSCRIPT 12t=1T1Nt𝑿(t)𝒖(t)𝒚(t)22+λd=1D𝒘d22,12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡superscriptsubscriptnormsuperscript𝑿𝑡superscript𝒖𝑡superscript𝒚𝑡22𝜆superscriptsubscript𝑑1𝐷superscriptsubscriptnormsubscript𝒘𝑑22\displaystyle\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{\boldsymbol{X}^{(t)}}% \boldsymbol{u}^{(t)}-\boldsymbol{y}^{(t)}\|_{2}^{2}+\lambda\sum_{d=1}^{D}\|% \boldsymbol{w}_{d}\|_{2}^{2},divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_u start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
(28) s.t. ΘΘ=𝑰ΘsuperscriptΘtop𝑰\displaystyle\Theta\Theta^{\top}=\boldsymbol{I}roman_Θ roman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_italic_I

The solution process for problem (28) comprises two steps: fixing (Θ,𝒗)Θ𝒗(\Theta,\boldsymbol{v})( roman_Θ , bold_italic_v ) and then 𝒖𝒖\boldsymbol{u}bold_italic_u. The first step involves a convex problem, easily addressed by classic optimization methods such as stochastic gradient descent (SGD). The second step can be tackled using singular value decomposition (SVD) along with a series of linear algebra transformations. However, it is important to note that the non-convex ASO algorithm is not guaranteed to converge to a global optimum and may encounter challenges like getting stuck in local optima.

Convex ASO (cASO) (chen2009convex) investigates the use of convex relaxations to improve the convergence properties of the algorithm and can converge to a global optimum. Firstly, an improved ASO (iASO) formulation is proposed as an initial non-convex problem

t=1T1Ntmaxsuperscriptsubscript𝑡1𝑇1subscript𝑁𝑡\displaystyle\sum_{t=1}^{T}\frac{1}{N_{t}}\max∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG roman_max {𝟎,1(𝑿(t)𝒖(t))𝒚(t)}+λ1𝒖(t)Θ𝒗(t)2+λ2𝒖(t)2,01superscript𝑿𝑡superscript𝒖𝑡superscript𝒚𝑡subscript𝜆1superscriptnormsuperscript𝒖𝑡superscriptΘtopsuperscript𝒗𝑡2subscript𝜆2superscriptnormsuperscript𝒖𝑡2\displaystyle\{\boldsymbol{0},1-(\boldsymbol{X}^{(t)}\boldsymbol{u}^{(t)})% \cdot\boldsymbol{y}^{(t)}\}+\lambda_{1}\|\boldsymbol{u}^{(t)}-\Theta^{\top}% \boldsymbol{v}^{(t)}\|^{2}+\lambda_{2}\|\boldsymbol{u}^{(t)}\|^{2},{ bold_0 , 1 - ( bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_u start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ⋅ bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_u start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - roman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_u start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
(29) s.t. ΘΘ=𝑰,ΘsuperscriptΘtop𝑰\displaystyle\Theta\Theta^{\top}=\boldsymbol{I},roman_Θ roman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_italic_I ,

where the intercept is omitted in SVM learner for simplicity. In Eq. (29), the constraint terms effectively manage both task relatedness and model complexity. It is noteworthy that the traditional ASO formulation, represented Eq. (28), serves as a special case of iASO, irrespective of the loss function choices.

To address the non-convex iASO problem (29), based on the observation that 𝒖(t)=Θ𝒗(t)superscript𝒖𝑡superscriptΘtopsuperscript𝒗𝑡\boldsymbol{u}^{(t)}=\Theta^{\top}\boldsymbol{v}^{(t)}bold_italic_u start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT minimizes the constraint terms, the formulation of the constraint term can be restructured as

(30) 𝑮(𝑼,Θ)=λ1η(1η)tr(𝑼(η𝑰+ΘΘ)1𝑼),𝑮𝑼Θsubscript𝜆1𝜂1𝜂trsuperscript𝑼topsuperscript𝜂𝑰superscriptΘtopΘ1𝑼\boldsymbol{G}(\boldsymbol{U},\Theta)=\lambda_{1}\eta(1-\eta)\text{tr}(% \boldsymbol{U}^{\top}(\eta\boldsymbol{I}+\Theta^{\top}\Theta)^{-1}\boldsymbol{% U}),bold_italic_G ( bold_italic_U , roman_Θ ) = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_η ( 1 - italic_η ) tr ( bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_η bold_italic_I + roman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Θ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_U ) ,

where η=λ2/λ1>0𝜂subscript𝜆2subscript𝜆10\eta=\lambda_{2}/\lambda_{1}>0italic_η = italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 and 𝑼=[𝒖(1),,𝒖(T)]𝑼superscript𝒖1superscript𝒖𝑇\boldsymbol{U}=[\boldsymbol{u}^{(1)},\cdots,\boldsymbol{u}^{(T)}]bold_italic_U = [ bold_italic_u start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , bold_italic_u start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ]. Thus, the convex ASO formulation can be written as

t=1Tsuperscriptsubscript𝑡1𝑇\displaystyle\sum_{t=1}^{T}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT 1Ntmax{𝟎,1(𝑿(t)𝒖(t))𝒚(t)}+𝑮(𝑼,Θ),1subscript𝑁𝑡01superscript𝑿𝑡superscript𝒖𝑡superscript𝒚𝑡𝑮𝑼Θ\displaystyle\frac{1}{N_{t}}\max\{\boldsymbol{0},1-(\boldsymbol{X}^{(t)}% \boldsymbol{u}^{(t)})\cdot\boldsymbol{y}^{(t)}\}+\boldsymbol{G}(\boldsymbol{U}% ,\Theta),divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG roman_max { bold_0 , 1 - ( bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_u start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ⋅ bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } + bold_italic_G ( bold_italic_U , roman_Θ ) ,
(31) s.t. ΘΘ=𝑰.ΘsuperscriptΘtop𝑰\displaystyle\Theta\Theta^{\top}=\boldsymbol{I}.roman_Θ roman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_italic_I .

The convex optimization procedures contain the alternating steps of the estimation of 𝑼𝑼\boldsymbol{U}bold_italic_U with the fixed ΘΘsuperscriptΘtopΘ\Theta^{\top}\Thetaroman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Θ and the estimation of ΘΘsuperscriptΘtopΘ\Theta^{\top}\Thetaroman_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Θ with a fixed 𝑼𝑼\boldsymbol{U}bold_italic_U. Via the convergence analysis, it is proved that cASO (31) can converge to a global optimum (chen2009convex).

Multi-level Lasso, introduced by lozano2012multi, is an approach that relies on the decomposition of the regression coefficients into two components—one shared across all tasks and another designed to capture task-specific features. Specifically, lozano2012multi suppose that the “global” sparsity would be controlled by a part of the “main effect” variables. Thus, an alternative decomposition is proposed to satisfy the desired property by rewriting 𝒘tsuperscript𝒘𝑡\boldsymbol{w}^{t}bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as

(32) 𝒘dt=θd𝜸d(t),d=1,,D,formulae-sequencesubscriptsuperscript𝒘𝑡𝑑subscript𝜃𝑑subscriptsuperscript𝜸𝑡𝑑𝑑1𝐷\boldsymbol{w}^{t}_{d}=\theta_{d}\boldsymbol{\gamma}^{(t)}_{d},d=1,\cdots,D,bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT bold_italic_γ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_d = 1 , ⋯ , italic_D ,

where θdsubscript𝜃𝑑\theta_{d}italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT indicates the “effect” from the d𝑑ditalic_d-th feature, and 𝜸d(t)superscriptsubscript𝜸𝑑𝑡\boldsymbol{\gamma}_{d}^{(t)}bold_italic_γ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT reflects task specificity. Accordingly, the optimization problem can be written as

(33) min𝑾12t=1T1Nt𝑿(t)𝒘t𝒚(t)22+λ1d=1Dθd+λ2d=1D𝜸d1,s.t.𝑾=𝜽𝚲𝚪,𝜽𝟎.formulae-sequencesubscript𝑾12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡subscriptsuperscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1superscriptsubscript𝑑1𝐷subscript𝜃𝑑subscript𝜆2superscriptsubscript𝑑1𝐷subscriptnormsubscript𝜸𝑑1𝑠𝑡formulae-sequence𝑾𝜽𝚲𝚪𝜽0\displaystyle\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum\limits_{t=1}^{T}\frac% {1}{N_{t}}\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\|^{2% }_{2}+\lambda_{1}\sum\limits_{d=1}^{D}\theta_{d}+\lambda_{2}\sum\limits_{d=1}^% {D}\|\boldsymbol{\boldsymbol{\gamma}}_{d}\|_{1},\quad s.t.~{}\boldsymbol{W}=% \vec{\boldsymbol{\theta}}\boldsymbol{\Lambda}\boldsymbol{\Gamma},\vec{% \boldsymbol{\theta}}\geq\boldsymbol{0}.roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∥ bold_italic_γ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s . italic_t . bold_italic_W = over→ start_ARG bold_italic_θ end_ARG bold_Λ bold_Γ , over→ start_ARG bold_italic_θ end_ARG ≥ bold_0 .

This model accommodates variations in support across multiple tasks while preserving common structures. The optimization process involves iteratively solving for either θ𝜃\thetaitalic_θ oder γ𝛾\gammaitalic_γ while keeping the other fixed, which is proved to be converged in lozano2012multi. The limitation is associated with the alternate optimization procedure of Multi-level Lasso. When learning γ𝛾\gammaitalic_γ while fixing θ𝜃\thetaitalic_θ, this process essentially becomes a classical Lasso problem, which is relatively easy to solve. However, obtaining the solution for the global problem can be time-consuming, as pointed out in friedman2007pathwise.

Remarks (i) Decomposition methods facilitate the learning of additional task relatedness via imposing different regularizations on the weight components from the decomposition. (ii) Regularizations applied to different components can indeed introduce new challenges in the optimization process when solving the problem.

2.1.5. Priori Sharing

Multi-task priori sharing focuses on understanding and exploiting the relationships between different tasks to improve learning efficiency and performance. This approach is predicated on the idea that tasks, especially those that are related, can provide complementary information that enhances learning when approached collectively rather than in isolation. By identifying and leveraging the priori interconnections among tasks, priori sharing aims to achieve better generalization, more robust models, and improved predictions for each task.

The typical formulation of priori sharing in MTL is given in the same form as equation (4) This optimization objective function seeks to minimize a cumulative loss function over T𝑇Titalic_T tasks, which is a summation of individual losses for each task’s predictions against its true values, adjusted by a global regularization term. The regularization term, λΩ(𝑾)𝜆Ω𝑾\lambda\Omega(\boldsymbol{W})italic_λ roman_Ω ( bold_italic_W ) is then applied to the combined weight vector 𝑾𝑾\boldsymbol{W}bold_italic_W which concatenates all task-specific weights 𝒘(t)superscript𝒘𝑡\boldsymbol{w}^{(t)}bold_italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, thereby incorporating shared information across tasks into the model. It is designed based on a priori knowledge of task interrelations and enforces certain structure of constraints on 𝑾𝑾\boldsymbol{W}bold_italic_W to reflect the assumed relationships between tasks within the model. This formulation allows for the integration of similarities and differences across tasks to inform the learning process, aiming to improve the generalization of the model by leveraging shared patterns and task-specific peculiarities. The categorization of multi-task prior sharing can be broadly understood in the following ways:

Task similarity. There is compelling evidence supporting the advantages of learning information from multiple task domains compared to single-task data. In earlier studies, such as evgeniou2004regularized, and parameswaran2010large, the formulation proposed by multi-task relationship learning was all generated based on prior assumptions of task relatedness. Specifically, evgeniou2004regularized, and parameswaran2010large assumed that the learning tasks are similar to each other and employed task-coupling parameters to model the target average task. In Regularized MTL (evgeniou2004regularized), task-coupling parameters were utilized to model the relationships between tasks and extend existing kernel-based single-task methods like support vector machine (SVM) through a novel kernel function. Their formulation is

min𝒘0,𝒗0,ξit{t=1Ti=1mξit+λ1Tt=1T𝒗t22+λ2𝒘022},subscriptsubscript𝒘0subscript𝒗0subscript𝜉𝑖𝑡superscriptsubscript𝑡1𝑇superscriptsubscript𝑖1𝑚subscript𝜉𝑖𝑡subscript𝜆1𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptnormsubscript𝒗𝑡22subscript𝜆2superscriptsubscriptnormsubscript𝒘022\displaystyle\min_{\boldsymbol{w}_{0},\boldsymbol{v}_{0},\xi_{it}}\big{\{}\sum% _{t=1}^{T}\sum_{i=1}^{m}\xi_{it}+\frac{\lambda_{1}}{T}\sum_{t=1}^{T}\|% \boldsymbol{v}_{t}\|_{2}^{2}+\lambda_{2}\|\boldsymbol{w}_{0}\|_{2}^{2}\big{\}},roman_min start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT + divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } ,
(34) s.t.yit(𝒘0+𝒗t)𝒙it1ξit,ξit0,i{1,2,,m}andt{1,2,,T}formulae-sequence𝑠𝑡formulae-sequencesubscript𝑦𝑖𝑡subscript𝒘0subscript𝒗𝑡subscript𝒙𝑖𝑡1subscript𝜉𝑖𝑡formulae-sequencesubscript𝜉𝑖𝑡0for-all𝑖12𝑚and𝑡12𝑇\displaystyle s.t.\quad y_{it}(\boldsymbol{w}_{0}+\boldsymbol{v}_{t})\cdot% \boldsymbol{x}_{it}\geq 1-\xi_{it},\,\,\xi_{it}\geq 0,\forall i\in\{1,2,\dots,% m\}\,\text{and}\,\,t\in\{1,2,\dots,T\}italic_s . italic_t . italic_y start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ bold_italic_x start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ≥ 1 - italic_ξ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ≥ 0 , ∀ italic_i ∈ { 1 , 2 , … , italic_m } and italic_t ∈ { 1 , 2 , … , italic_T }

where m𝑚mitalic_m represents sample size of data points for each task, ξitsubscript𝜉𝑖𝑡\xi_{it}italic_ξ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT represents the error for each estimation of parameter 𝒘0+𝒗tsubscript𝒘0subscript𝒗𝑡\boldsymbol{w}_{0}+\boldsymbol{v}_{t}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT generated from the data distribution. They followed the formulation from Hierarchical Bayes (allenby1998marketing; arora1998hierarchical; heskes2000empirical) and described the target T functions as hyperplanes ft(x)=𝒘t𝒙subscript𝑓𝑡𝑥subscript𝒘𝑡𝒙f_{t}(x)=\boldsymbol{w}_{t}\cdot\boldsymbol{x}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_italic_x, where 𝒘t=𝒘0+𝒗tsubscript𝒘𝑡subscript𝒘0subscript𝒗𝑡\boldsymbol{w}_{t}=\boldsymbol{w}_{0}+\boldsymbol{v}_{t}bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes each corresponding target model. In their approach, the authors assume that when learning from tasks that are similar to each other, the discrepancies between different tasks 𝒗tsubscript𝒗𝑡\boldsymbol{v}_{t}bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are small, and the task relationships are linked to a common model 𝒘0subscript𝒘0\boldsymbol{w}_{0}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Additionally, evgeniou2005learning and kato2007multi provide prior information on the similarities between pairs of tasks and incorporate regularization terms to adjust the learning of multiple tasks in a manner that aligns the distance between model parameters with the distance between tasks. Furthermore, gornitz2011hierarchical describes the relationship between tasks using a tree structure, and the model parameters learn the similarity from their parent nodes.

Task correlation. Nevertheless, simply assuming the relationship among tasks without evidence support is somewhat detrimental and may extrapolate the results. By proposing a model that learns task relatedness directly from the data, Bayesian models like bonilla2007multi defines prior information over all the unobserved functions for each task and adapts the model parameters regarding the task identities as well as observed information without giving much model assumptions. Particularly, they use multi-task Gaussian Process (GP) prediction techniques to model the correlation among tasks, the formulation is

<fl(𝒙)fk(𝒙)>=Klkfkx<𝒙,𝒙>,yil𝒩(fl(xi),σl2),l,k{1,,T},i{1,N}\displaystyle<f_{l}(\boldsymbol{x})f_{k}(\boldsymbol{x}^{\top})>=K_{lk}^{f}k^{% x}<\boldsymbol{x},\boldsymbol{x}^{\top}>,y_{il}\sim\mathcal{N}(f_{l}(x_{i}),% \sigma_{l}^{2}),l,k\in\{1,\dots,T\},i\in\{1,\dots N\}< italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_x ) italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) > = italic_K start_POSTSUBSCRIPT italic_l italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT < bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT > , italic_y start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_l , italic_k ∈ { 1 , … , italic_T } , italic_i ∈ { 1 , … italic_N }
(35) min𝜽X(Nlog|<FT(𝑲x(𝜽x))1F>+Tlog|𝑲x(𝜽x)|),subscriptsubscript𝜽𝑋conditional𝑁expectationsuperscript𝐹𝑇superscriptsuperscript𝑲𝑥subscript𝜽𝑥1𝐹𝑇superscript𝑲𝑥subscript𝜽𝑥\displaystyle\min_{\boldsymbol{\theta}_{X}}\bigg{(}N\log|<F^{T}(\boldsymbol{K}% ^{x}(\boldsymbol{\theta}_{x}))^{-1}F>|+T\log|\boldsymbol{K}^{x}(\boldsymbol{% \theta}_{x})|\bigg{)},roman_min start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_N roman_log | < italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_K start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F > | + italic_T roman_log | bold_italic_K start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) | ) ,

where they approach this problem by placing a GP prior over the latent functions {fl}subscript𝑓𝑙\{f_{l}\}{ italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } to directly induce correlations between tasks, 𝑲fsuperscript𝑲𝑓\boldsymbol{K}^{f}bold_italic_K start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT denotes the inter-task dependency via a positive semi-definite (PSD) matrix, kxsuperscript𝑘𝑥k^{x}italic_k start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT denotes the covariance between input data points, and σl2superscriptsubscript𝜎𝑙2\sigma_{l}^{2}italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT refers to the random noise of the l𝑙litalic_l-th task, 𝑭𝑭\boldsymbol{F}bold_italic_F is the vector of function values corresponding to 𝒀𝒀\boldsymbol{Y}bold_italic_Y. bonilla2007multi introduces a novel approach that employs a common covariance function for input features and a ’free-form’ covariance matrix for different tasks, offering significant flexibility in modeling diverse data forms and task relationship. Furthermore, the utilization of this ’free-form’ covariance matrix mitigates the need for extensive observed data, enhancing the efficiency of the method. To address the overfitting concern stemming from the point estimation approach in bonilla2007multi, zhang2010multi extended multi-task GP to a weight-space view for the multi-task t𝑡titalic_t process, incorporating an inverse-Wishart prior to modeling the covariance matrix. This adaptation helps mitigate overfitting and enhances the robustness of the method.

Task covariance. In addition to learning through task correlation and task similarities, zhang2012convex; zhang2014regularization introduced the concept of Multi-Task Relationship Learning (MTRL) by utilizing a task covariance matrix to capture task relatedness. Within the regularization framework, they derived a convex formulation for multi-task learning, enabling simultaneous learning of model parameters and task relationship. Their innovation lies in the application of a matrix-variate normal prior on the weight matrix 𝑾𝑾\boldsymbol{W}bold_italic_W, lending a structured prior, alongside certain likelihood functions, to guide the formulation of an objective function that seeks for a posterior solution maximizing the likelihood. The objective function they employed is

min𝑾,𝛀(𝑾)+λ1𝑾F2+λ2tr(𝑾𝛀1𝑾T)subscript𝑾𝛀𝑾subscript𝜆1superscriptsubscriptnorm𝑾𝐹2subscript𝜆2𝑡𝑟𝑾superscript𝛀1superscript𝑾𝑇\displaystyle\min_{\boldsymbol{W},\boldsymbol{\Omega}}{\mathcal{L}}(% \boldsymbol{W})+\lambda_{1}||\boldsymbol{W}||_{F}^{2}+\lambda_{2}tr(% \boldsymbol{W}\boldsymbol{\Omega}^{-1}\boldsymbol{W}^{T})roman_min start_POSTSUBSCRIPT bold_italic_W , bold_Ω end_POSTSUBSCRIPT caligraphic_L ( bold_italic_W ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | bold_italic_W | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t italic_r ( bold_italic_W bold_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )
(36) s.t.𝛀0,tr(𝛀)1,formulae-sequence𝑠𝑡formulae-sequencesucceeds𝛀0𝑡𝑟𝛀1\displaystyle s.t.\quad\boldsymbol{\Omega}\succ 0,tr(\boldsymbol{\Omega})\leq 1,italic_s . italic_t . bold_Ω ≻ 0 , italic_t italic_r ( bold_Ω ) ≤ 1 ,

where the optimization target they proposed can be expressed as the minimization of a loss function (𝑾)𝑾{\mathcal{L}}(\boldsymbol{W})caligraphic_L ( bold_italic_W ) augmented by a regularization term scaled by λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that penalizes the Frobenius norm of 𝑾𝑾\boldsymbol{W}bold_italic_W, and an additional term scaled by λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT involving the trace of 𝑾𝛀1𝑾T𝑾superscript𝛀1superscript𝑾𝑇\boldsymbol{W}\boldsymbol{\Omega}^{-1}\boldsymbol{W}^{T}bold_italic_W bold_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, reflecting the matrix-variate normal prior. Here, 𝛀𝛀\boldsymbol{\Omega}bold_Ω denotes a positive definite matrix capturing task covariance, and its complexity is controlled through constraints ensuring its positive definiteness and bounded trace. This formulation has been established as jointly convex in 𝑾,𝛀𝑾𝛀\boldsymbol{W},\boldsymbol{\Omega}bold_italic_W , bold_Ω, allowing for simultaneous optimization of model parameters and task covariance matrix.

In essence, their approach extends the principles of single-task learning with regularization while incorporating alternative optimization techniques to achieve a convex objective function. Further developments have extended this framework to enhance multi-task boosting (zhang2012multi) and multi-label learning (zhang2013multilabel), illustrating its adaptability and potential for a broad spectrum of applications. The approach also offers an interpretative angle from the viewpoint of reproducing kernel Hilbert spaces for vector-valued functions (ciliberto2015learning; jawanpuria2015efficient), showcasing its theoretical elegance and practical utility. Also, in the context of MTL with a considerable number of tasks, it becomes evident that not all tasks are equally interrelated; many display a tendency toward sparsity in their inter-task relationships. Recognizing that a task may not contribute meaningfully to every other task and that sparse task relationships can mitigate overfitting issues more effectively than dense relationships, there is a growing interest in models that can capture these sparse patterns. zhang2017learning pays attention to the elucidation of such sparse task relationships, and the objective function can be written as

(37) min𝑾,𝛀0t=1T1Ntj=1Nt(𝒘tϕ(xj),yj)+λ12tr(𝑾𝛀1𝑾)+λ2𝛀1,subscript𝑾𝛀0superscriptsubscript𝑡1𝑇1subscript𝑁𝑡superscriptsubscript𝑗1subscript𝑁𝑡superscriptsubscript𝒘𝑡topbold-italic-ϕsubscript𝑥𝑗subscript𝑦𝑗subscript𝜆12tr𝑾superscript𝛀1superscript𝑾topsubscript𝜆2subscriptnorm𝛀1\displaystyle\min_{\boldsymbol{W},\boldsymbol{\Omega}\geq 0}\sum_{t=1}^{T}% \frac{1}{N_{t}}\sum_{j=1}^{N_{t}}{\mathcal{L}}(\boldsymbol{w}_{t}^{\top}% \boldsymbol{\phi}(x_{j}),y_{j})+\frac{\lambda_{1}}{2}\text{tr}(\boldsymbol{W}% \boldsymbol{\Omega}^{-1}\boldsymbol{W}^{\top})+\lambda_{2}\|\boldsymbol{\Omega% }\|_{1},roman_min start_POSTSUBSCRIPT bold_italic_W , bold_Ω ≥ 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L ( bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG tr ( bold_italic_W bold_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_Ω ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where ϕ()bold-italic-ϕ\boldsymbol{\phi}(\cdot)bold_italic_ϕ ( ⋅ ) corresponds to the feature mapping, and the learning task refers to ft(𝒙)=𝒘tϕ(𝒙)subscript𝑓𝑡𝒙superscriptsubscript𝒘𝑡topitalic-ϕ𝒙f_{t}(\boldsymbol{x})=\boldsymbol{w}_{t}^{\top}\phi(\boldsymbol{x})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) = bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ ( bold_italic_x ). By adding an l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization on the covariance matrix 𝛀𝛀\boldsymbol{\Omega}bold_Ω, their proposed approach, termed the SParse covAriance based mulTi-taSk (SPATS) model, is designed to determine a sparse task covariance structure. This method embraces the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization, renowned for promoting sparsity, within a regularization framework tailored for MTL. The convex nature of the SPATS model’s objective function facilitates the development of an efficient alternating optimization strategy to find the solution.

Remarks (i) In environments where tasks are interdependent and data is limited or imbalanced, the ability to discern and exploit the latent task interrelations becomes crucial. (ii) Overestimating task similarity can lead to negative transfer, where learning one task may adversely affect the performance of another. Task similarities might change dynamically during training, requiring adaptive models that can adjust to these changes. (iii) Models that heavily rely on task covariances are at risk of overfitting to the specific relations present in the training data, reducing their generalization capabilities.

2.1.6. Task Clustering/Grouping

Task relationships can be elucidated through the clustering or grouping of associated tasks, whereby tasks within the same cluster exhibit greater similarities. Executing clustering algorithms at the task level proves particularly advantageous in scenarios with numerous tasks. Typically, task clustering requires leveraging shared structural information across tasks, such as task similarity or distance. These are termed horizontal methods contrasting with hierarchical methods that harness inherent task structures, such as tree formations, to achieve MTL. Task priori sharing and clustering are closely related as both share the commonness across tasks, but clustered structure is an unknown priori that needs to be learned. For example, the problem defined in Eq. (34) could also be equivalent to solving the following optimization problem (See proof in evgeniou2004regularized):

min𝒘t,ξit{t=1Ti=1mξit+λ1λ2T(λ1+λ2)t=1T𝒘t2+λ12T(λ1+λ2)t=1T𝒘t1Ts=1T𝒘s2},subscriptsubscript𝒘𝑡subscript𝜉𝑖𝑡superscriptsubscript𝑡1𝑇superscriptsubscript𝑖1𝑚subscript𝜉𝑖𝑡subscript𝜆1subscript𝜆2𝑇subscript𝜆1subscript𝜆2superscriptsubscript𝑡1𝑇superscriptnormsubscript𝒘𝑡2superscriptsubscript𝜆12𝑇subscript𝜆1subscript𝜆2superscriptsubscript𝑡1𝑇superscriptnormsubscript𝒘𝑡1𝑇superscriptsubscript𝑠1𝑇subscript𝒘𝑠2\displaystyle\min_{\boldsymbol{w}_{t},\xi_{it}}\big{\{}\sum_{t=1}^{T}\sum_{i=1% }^{m}\xi_{it}+\frac{\lambda_{1}\lambda_{2}}{T(\lambda_{1}+\lambda_{2})}\sum_{t% =1}^{T}\|\boldsymbol{w}_{t}\|^{2}+\frac{\lambda_{1}^{2}}{T(\lambda_{1}+\lambda% _{2})}\sum_{t=1}^{T}\|\boldsymbol{w}_{t}-\frac{1}{T}\sum_{s=1}^{T}\boldsymbol{% w}_{s}\|^{2}\big{\}},roman_min start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT + divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_T ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } ,
(38) s.t.yit𝒘t𝒙it1ξit,ξit0,formulae-sequence𝑠𝑡formulae-sequencesubscript𝑦𝑖𝑡subscript𝒘𝑡subscript𝒙𝑖𝑡1subscript𝜉𝑖𝑡subscript𝜉𝑖𝑡0\displaystyle s.t.\quad y_{it}\cdot\boldsymbol{w}_{t}\cdot\boldsymbol{x}_{it}% \geq 1-\xi_{it},\,\,\xi_{it}\geq 0,italic_s . italic_t . italic_y start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ⋅ bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_italic_x start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ≥ 1 - italic_ξ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ≥ 0 ,

where 𝒘t=𝒘0+𝒗tsubscript𝒘𝑡subscript𝒘0subscript𝒗𝑡\boldsymbol{w}_{t}=\boldsymbol{w}_{0}+\boldsymbol{v}_{t}bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (see Eq. (34)). The second regularization term in Eq. (38) implies that all tasks are clustered into a single group, and the parameters across all tasks are constrained to exhibit maximum similarity. This special case shows that all tasks are clustered into one group. In practice, however, it is worth noting that certain related tasks might frequently be clustered into different groups.

Horizontal Methods

Clustered Multi-Task Learning (CMTL) (zhou2011clustered) assumes that multiple tasks in the same cluster are similar to each other, and provides the insights of inherent relationships between ASO (ando2005framework) and CMTL. Specifically, the CMTL is non-convex, and the proposed convex relaxation of CMTL is equivalent to an existing convex relaxation of ASO. The objective function of CMTL can be formulated as

min𝑾,𝑭subscript𝑾𝑭\displaystyle\min\limits_{\boldsymbol{W},\boldsymbol{F}}roman_min start_POSTSUBSCRIPT bold_italic_W , bold_italic_F end_POSTSUBSCRIPT 12t=1T1Nt𝑿(t)𝒘t𝒚t22+λ1(tr(𝑾𝑾)tr(𝑭𝑾𝑾𝑭))+λ2t=1T𝒘t22,12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡superscriptsubscriptnormsuperscript𝑿𝑡superscript𝒘𝑡superscript𝒚𝑡22subscript𝜆1trsuperscript𝑾top𝑾trsuperscript𝑭topsuperscript𝑾top𝑾𝑭subscript𝜆2superscriptsubscript𝑡1𝑇subscriptsuperscriptnormsuperscript𝒘𝑡22\displaystyle\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{\boldsymbol{X}^{(t)}}% \boldsymbol{w}^{t}-\boldsymbol{y}^{t}\|_{2}^{2}+\lambda_{1}(\text{tr}(% \boldsymbol{W}^{\top}\boldsymbol{W})-\text{tr}(\boldsymbol{F}^{\top}% \boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{F}))+\lambda_{2}\sum_{t=1}^{T}{% \|\boldsymbol{w}^{t}\|}^{2}_{2},divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( tr ( bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W ) - tr ( bold_italic_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W bold_italic_F ) ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
(39) s.t. 𝑭t,j=1/njift𝒞jotherwise0,t=1,,T,formulae-sequencesubscript𝑭𝑡𝑗1subscript𝑛𝑗if𝑡subscript𝒞𝑗otherwise0𝑡1𝑇\displaystyle\boldsymbol{F}_{t,j}=1/\sqrt{n_{j}}~{}\text{if}~{}t\in\mathcal{C}% _{j}~{}\text{otherwise}~{}0,t=1,\cdots,T,bold_italic_F start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT = 1 / square-root start_ARG italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG if italic_t ∈ caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT otherwise 0 , italic_t = 1 , ⋯ , italic_T ,

where njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the #task in the j𝑗jitalic_j-th cluster 𝒞jsubscript𝒞𝑗\mathbf{\mathcal{C}}_{j}caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Hierarchical Methods

TAsk Tree (TAT) (han2015learning) model is the first method for MTL to learn the tree structure under the regularization framework. By specifying the number of tree layers as H𝐻Hitalic_H, han2015learning utilizes matrix decomposition to learn model weights for each layer, i.e., {𝑾h}h=1Hsuperscriptsubscriptsubscript𝑾1𝐻\{\boldsymbol{W}_{h}\}_{h=1}^{H}{ bold_italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT. TAT devises sequential constraints on the distance between the consecutive weight matrices over tree layers. By combining the loss functions, its learning objective can be shown as:

min𝑾subscript𝑾\displaystyle\min\limits_{\boldsymbol{W}}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT 12t=1T1Nt𝑿(t)h=1H𝒘ht𝒚t22+h=1Hλhi<jT𝒘hi𝒘hj22,12superscriptsubscript𝑡1𝑇1subscript𝑁𝑡superscriptsubscriptnormsuperscript𝑿𝑡superscriptsubscript1𝐻superscriptsubscript𝒘𝑡superscript𝒚𝑡22superscriptsubscript1𝐻subscript𝜆superscriptsubscript𝑖𝑗𝑇subscriptsuperscriptnormsuperscriptsubscript𝒘𝑖superscriptsubscript𝒘𝑗22\displaystyle\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\|{\boldsymbol{X}^{(t)}}% \sum_{h=1}^{H}\boldsymbol{w}_{h}^{t}-\boldsymbol{y}^{t}\|_{2}^{2}+\sum_{h=1}^{% H}\lambda_{h}\sum_{i<j}^{T}\|\boldsymbol{w}_{h}^{i}-\boldsymbol{w}_{h}^{j}\|^{% 2}_{2},divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i < italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
(40) s.t. |𝒘h1i𝒘h1j||𝒘hi𝒘hj|,h2,i<j,formulae-sequencesucceeds-or-equalssuperscriptsubscript𝒘1𝑖superscriptsubscript𝒘1𝑗superscriptsubscript𝒘𝑖superscriptsubscript𝒘𝑗formulae-sequencefor-all2for-all𝑖𝑗\displaystyle|\boldsymbol{w}_{h-1}^{i}-\boldsymbol{w}_{h-1}^{j}|\succeq|% \boldsymbol{w}_{h}^{i}-\boldsymbol{w}_{h}^{j}|,\forall h\geq 2,\forall i<j,| bold_italic_w start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | ⪰ | bold_italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | , ∀ italic_h ≥ 2 , ∀ italic_i < italic_j ,

where the hyperparameters {λh}h=1Hsuperscriptsubscriptsubscript𝜆1𝐻\{\lambda_{h}\}_{h=1}^{H}{ italic_λ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT indicate the importance of different tree layers, and |||\cdot|| ⋅ | and succeeds-or-equals\succeq denotes the elementwise operation. This sequential constraint encourages a non-increasing order for the pair distance between tasks from bottom to top.

Remarks (i) Task clustering methods are scalable with respect to the number of tasks in MTL. (ii) Both clustering and priori sharing methods in MTL carry similar underlying meanings as they inherently decipher task relationships. (iii) Task clustering complements other MTL strategies, as any MTL approach can be implemented within the task clusters. (iv) Solutions in this section tend to be suboptimal, given that task clustering is not exclusive.
Table 5. Summary of deep MTL models.
Model Name Origin Year MTL Strategy Backbone Sharing Modality Task Measurement Loss Function Availability1
TCDCN ECCV zhang2014facial Early stopping CNN Hard Image Facial landmark detection/head pose estimation/ Mean error (mErr) (burgos2013robust), Mean squared error (MSE),
gender classification/age estimation/expression failure rate (dantone2012real) cross-entropy (CE) loss Official
recognition/facial attribute inference
ACL-
MTL-ML IJCNLP dong2015multi RNN Hard Text Multiple-target language translation BLEU-4 (papineni2002bleu), Delta CE loss
Vanilla Part-Of-Speech (POS)/Chunking/Combinatory
Cascading ACL sogaard-goldberg-2016-deep Cascading LSTM Hard Text Categorical Grammar (CCG) Supertagging F1 score, Micro-F1 score CE loss
Surface normals estimation (normals)/semantic mErr/median error (medErr)/within tsuperscript𝑡t^{\circ}italic_t start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT in angular
Cross-stitch segmentation (semseg), object detection/attribute distance (within tsuperscript𝑡t^{\circ}italic_t start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT), pixel accuracy (pixacc),
networks CVPR misra2016cross CNN Soft Image prediction mIoU, fwIU, mAP CE loss Unofficial
ASP-MTL (aka Hard & CE loss, adversarial loss,
AdvMTL) arXiv liu2017adversarial Adversarial training LSTM Soft Text Text classifications Error rate orthogonality constraint Official
Cascading, adding Part-Of-Speech (POS) tagging/chunking/parsing/ Accuracy (acc), F1, MSE, unlabeled attachment CE loss, softmax loss,
JMT EMNLP hashimoto-jmt:2017:EMNLP2017 constraints LSTM Soft Text semantic relatedness/textual entailment score (UAS)/labeled attachment score (LAS) KL-divergence Unofficial
Object detection/mask estimation/object Mask regression loss,
MNCs CVPR dai2016instance Cascading CNN Hard Image categorization mAP@@@@IoU softmax loss Official
FAFS CVPR lu2017fully NAS CNN Hard Image person attribute classification Acc/recall CE loss Official
Hard &
MRN NeurIPS long2017learning Task conditioning CNN Soft Image classifications on different domains Acc CE loss Official
Depth/scene parsing/contour rel (eigen2014depth)/RMSE/log10 mErr/ CE loss, softmax loss
PAD-Net CVPR xu2018pad Mutual distillation CNN Hard Image prediction/normals acc with threshold δ𝛿\deltaitalic_δ (acc-δ𝛿\deltaitalic_δ), IoU/acc Euclidean loss
MTA(adv)subscriptAadv\text{A}_{(\text{adv})}A start_POSTSUBSCRIPT ( adv ) end_POSTSUBSCRIPTN CVPR liu2018multi Adversarial training CNN Hard Image font/glyph, identity/pose/illumination Recognition rate CE loss, adversarial loss
cross-task rel/ berHu loss (laina2016deeper),
TRL ECCV zhang2018joint attention CNN Hard Image Depth estimation (depth)/semseg RMSE/acc-δ𝛿\deltaitalic_δ, pixacc/mean acc/mIoU CE loss, uncertainty loss
MMoE KDD ma2018modeling MoE MLP Hard & Tabular Income/education/marriage prediction, Area Under the Curve (AUC) CE loss Unofficial
soft data engagement/satisfaction in recommendation
Tabular
Soft Order ICLR meyerson2018beyond feature fusion CNN, MLP Soft data, image Classification, attribute recognition mErr CE loss
classification/colorization/edge/denoised
GREAT4MTL arXiv sinha2018gradient adversarial training CNN Hard Image reconstruction, depth/normal/keypoint Err, RMSE, 1|cos(,)|1cos1-|\text{cos}(\cdot,\cdot)|1 - | cos ( ⋅ , ⋅ ) | CE loss
Sluice Adding constraints, Hard & Chunking/entity recognition (NER)/semantic
networks AAAI ruder2019latent early stopping LSTM Soft Text role labeling (SRL)/POS tagging Acc CE loss Official
CNN, NER/Entity Mention Detection (EMD)/Relation F1 score/precision/recall, MUC/B3/CEAFe
HMTL AAAI sanh2019hierarchical cascading LSTM Hard Text Extraction (RE)/Coreference Resolution (CR) (moosavi2016coreference) CE loss Unofficial
CNN, Segment labeling/Named Entity Labeling CRF loss, CE loss, ranking
DCMTL AAAI gong2019deep cascading LSTM Hard Text (NEL)/slot filling F1 score/precision/recall loss (vu2016bi) Official
Normals/semseg, age estimation/gender mErr/medErr/within tsuperscript𝑡t^{\circ}italic_t start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, mIoU, pixacc, mean/
NDDR-CNN CVPR gao2019nddr feature fusion CNN Soft Image classification median absolute error (absErr), acc CE loss Official
cross-task RMSE/rel/acc with t𝑡titalic_t, mErr/medError/within CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss, berHu loss
PAP CVPR zhang2019pattern attention CNN Hard Image Semseg/depth/normals tsuperscript𝑡t^{\circ}italic_t start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, mIoU/mean accuracy (mAcc)/pixacc affinity loss (zhang2019pattern)
MTA(atten)subscriptAatten\text{A}_{(\text{atten})}A start_POSTSUBSCRIPT ( atten ) end_POSTSUBSCRIPTN Hard & Semseg/depth/normals, 10 classifications mIoU/pixacc, mErr/medErr/within tsuperscript𝑡t^{\circ}italic_t start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT,
(& DWA) CVPR liu2019end Adaptive weighting CNN Soft Image (visual domain decathlon2   ) absErr/real error, accuracy CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss, dot product Official
Semseg/depth/edge/normals/human parts/ mIoU/osdF/mErr/maximum F-measure (maxF)/
ASTMT CVPR maninis2019attentive attention, single-tasking CNN Hard Image saliency estimation/albedo RMSE/ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss Official
ML-GCN CVPR chen2019multi Graph based CNN, GCN Hard Image Multi-label recognition precision, recall, F1 CE loss Official
RD4MTL arXiv meng2019representation Adversarial training CNN Hard Image Classifications Acc CE loss, adversarial loss Official
MTL-NAS CVPR gao2020mtl NAS CNN Adaptive Image Semseg/normals, object classification/scene mErr/medErr/Within tsuperscript𝑡t^{\circ}italic_t start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, mIoU/pixacc, CE loss, 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Official
classification Acc
Semseg/edge/depth/keypoint detection (point),
BMTN BMVC vandenhende2019branched NAS CNN Adaptive Image attribute classification mIoU, pixacc, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Acc CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss, ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT Official
PSD CVPR zhou2020pattern Distillation CNN Hard & soft Image Semseg/depth/normals RMSE/rel/acc with t𝑡titalic_t, mIoU/mean accuracy/ CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss, berHu loss
pixacc, mErr/medErr/within tsuperscript𝑡t^{\circ}italic_t start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT
ECCV distillation Hard & mIoU/pixacc, absErr/rel, mErr/medErr/within
KD4MTL Workshop li2020knowledge knowledge CNN soft Image Semseg/depth/normals, classification tsuperscript𝑡t^{\circ}italic_t start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, Acc CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss, dot product Official
MTI-Net ECCV vandenhende2020mti multi-task CNN Hard & Image Semseg/depth/edges detection (edges)/normals/ mIoU, RMSE, mErr, optimal dataset-scale F- CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss Official
distillation Soft saliency estimation/human parts measure (odsF) (martin2004learning), ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
NAS, Regression, face attribute prediction, semseg/
LTB ICML guo2020learning task grouping CNN Soft Image normals/depth/keypoints/edges Acc, CE, cos, mean absErr CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss, cosine loss
CNN &
AAMTRL ICML mao2020adaptive adversarial training LSTM Hard Text Classifications Relatedness evolution, acc, influence of #task Any 1-Lipschitz loss
Hard & Tabular Sub-tasks in the recommendation systems,
CGC & PLE RecSys tang2020progressive MoE MLP soft data income/education/marriage prediction AUC/MSE, MTL gain CE loss, 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Unofficial
TSNs ICCV sun2021task task relationship learning, CNN Hard Image Semseg/depth/edges/normals/ mIoU, RMSE, mErr, odsF, ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss Official
task conditioning saliency estimation/human parts
knowledge distillation, Classification/detection/semseg/depth/
MuST ICCV ghiasi2021multi task conditioning CNN Hard Image normals Acc, mIoU, RMSE, odsF CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss
AuxSegNet ICCV xu2021leveraging cross-task CNN Hard & Image Semseg/classification/saliency detection mIoU/precision/recall Multi-label softmax Official
attention Soft loss, CE loss
cross-task Hard & Semseg/depth estimation/edges/normals/
ATRC ICCV bruggemann2021exploring attention CNN soft Image saliency estimation/human parts mIoU, RMSE, mErr, odsF, maxF, ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss Official
DSelect-k NeurIPS hazimeh2021dselect MoE MLP, CNN Hard & Tabular engagement/satisfaction task, classification Total loss, Acc, AUC/RMSE, #expert CE loss, 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Official
soft data, Image
Hard & 16 Language understanding tasks, e.g. textual Acc, Spearman correlation spearman1961proof, Matthews
MT-TaG ArXiv gupta2022sparsely MoE Transformer soft Text entailment, sentiment classification, etc. correlation coefficient (matthews1975comparison) CE loss, MSE
Hard & Tabular
CrossDistil AAAI yang2022cross distillation MLP soft data Finish watching/like AUC, multi-AUC (hand2001simple) CE loss
MulT CVPR bhattacharjee2022mult cross-task CNN & Hard Image Semseg/depth/reshading/normals/ MTL gain, mErr of CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss, Official
attention Transformer keypoints/edges domain generalization rotate loss (zamir2018taskonomy)
cross-task attention, CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss, cross-
task balancing (spec., Hard & Semseg/depth/saliency detection/ task contrastive loss,
MTFormer ECCV xu2022mtformer kendall2018multi) Transformer soft Image human parts mIoU, RMSE, ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT uncertainty loss
MQTransformer arXiv xu2022multi cross-task Transformer Hard & Image Semseg/depth/edges/normals/saliency mIoU, RMSE, mErr, odsF, maxF CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss
attention Soft estimation/human parts
Image,
MetaLink ICLR cao2022relational Graph based MLP, GNN Hard Graph Classification mAP, ROC AUC CE loss Official
DeMT AAAI zhang2023demt cross-task CNN & Hard & Image Semseg/depth/edges/normals/saliency mIoU, RMSE, mErr, odsF, maxF, CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss Official
attention Transformer Soft estimation/human parts ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
cross-task Hard & CE loss, berHu loss, cosine
mTEB WACV lopes2023cross attention CNN soft Image Semseg/depth/normals/edges ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, mIoU, RMSE, mErr, F1 loss (guizilini2021geometric) Official
OKD-MTL WACV jacob2023online distillation, task Transformer Hard & Image Semseg/depth/normals ΔpsubscriptΔ𝑝\Delta_{p}roman_Δ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, mIoU/pixacc, absErr/rel, mErr/medErr/ Adaptive feature distillation loss,
weighting Soft within tsuperscript𝑡t^{\circ}italic_t start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT CE loss, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss, cosine loss
Hard &
AdaMV-MoE ICCV chen2023adamv MoE Transformer soft Image classification/detection/Seg Acc, Average Precision (AP) CE loss Official
  • 1

    This column provides the link to the implementation or execution. Click on "Official" or "Unofficial" to access the website.

  • 2

    Part of PASCAL in Detail Workshop Challenge, CVPR 2017, July 26th, Honolulu, Hawaii, USA. https://www.robots.ox.ac.uk/similar-to\simvgg/decathlon.

  • 3

    We use “state" here to represent the domain of reinforcement learning, including the observations of states of environment, the positions of object, the actions made by agent, etc.

  • 4

    The average rank of MTL on all different tasks. MR = 1 if a method ranks first across all tasks.

2.2. DL Era: Effective and Diversified

With the advent of DL, more powerful computational units and more effective memory bandwidth, e.g., Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), have made it possible to learn richer features for challenging tasks. Deep MTL methods, unlike traditional MTL methods imposing parameter regularizations or decompositions, can handle large-scale parameter sharing, feature propagation, NAS, task balancing, and optimization intervention, to name a few. The traditional techniques often involve complicated mathematical analysis but fail to learn a satisfactory performance in the real-world scenario with noise-polluted data or loosely-related tasks. However, deep MTL methods can overcome these issues by (1) directly extracting features in raw data and gradually elevating features layer-by-layer from low-level textures to mid-level semantics to high-level responses; and (2) progressively learning activations by stochastic gradients descent (SGD) (robbins1951stochastic; lecun2002efficient) that is provably efficient and practical in obtaining an expressive networks (livni2014computational). In this manner, hierarchical features can be efficiently communicated at different levels for jointly learning of multi-task objectives.

This section begins with a discussion of the architecture taxonomy commonly adopted in deep MTL, which serves as the backbone for the rest of the method overview. In the following, we summarize the feature propagation techniques that include feature fusion (see § 2.2.1), cascading (see § 2.2.2), distillation (see § 2.2.3), and cross-task attention (see § 2.2.4). These techniques encourage networks to automatically combine the features learned from different tasks, addressing the crucial challenge of effectively and efficiently utilizing the rich features enabled by DL. § 2.2.5 presents an overview of task balancing techniques in deep MTL, incorporating the linear combination of different tasks through three essential factors: gradient, loss, and learning speed. The comparison and recalibration of these factors aim to coordinate diverse tasks during the model weight update process. We will discuss this section from the point of gradient correction and dynamic weighting. In contrast, § 2.2.6 explores MOO in the context of MTL, which aims to simultaneously optimize potentially conflicting objective functions. Other promising topics covered include adversarial multi-task training (see § 2.2.7), MoE (see § 2.2.8), GCN-based MTL (see § 2.2.9), and NAS for MTL (see § 2.2.10). The summary of deep MTL models is presented in Table 5, and representative DL frameworks in MTL are illustrated in Fig. 8.

Refer to caption
(a) Cross-stitch unit.
Refer to caption
(b) Sluice block.
Refer to caption
(c) NDDR unit.
Refer to caption
(d) Soft Order.
Refer to caption
(e) KD4MTL pipeline.
Refer to caption
(f) MuST pipeline.
Refer to caption
(g) OKD-MTL pipeline.
Refer to caption
(h) CrossDistil pipeline.
Refer to caption
(i) PAD module.
Refer to caption
(j) MTAN module.
Refer to caption
(k) MTI-Net.
Refer to caption
(l) PAP module.
Refer to caption
(m) PSD module.
Refer to caption
(n) ASTMT.
Refer to caption
(o) ATRC module.
Refer to caption
(p) DeMT block.
Refer to caption
(q) FAFS.
Refer to caption
(r) BMTN.
Refer to caption
(s) MTL-NAS module.
Refer to caption
(t) ASP-MTL.
Refer to caption
(u) MTA(adv)(adv){}_{\text{(adv)}}start_FLOATSUBSCRIPT (adv) end_FLOATSUBSCRIPTN.
Refer to caption
(v) RD4MTL.
Refer to caption
(w) GREAT4MTL.
Refer to caption
(x) ML-GCN.
Refer to caption
(y) MetaLink.
Figure 8. Frameworks of deep learning techniques used in MTL. (a–d) Feature fusion: cross-stitch networks, Sluice Network, NDDR-CNN, and Soft Order. (e–h) Knowledge distillation: KD4MTL, MuST, OKD-MTL, and CrossDistill. (i–p) Attention: PAD, MTAN, MTI-Net, PAP, PSD, ASTMT, ATRC, and DeMT. (q–s) NAS: FAFS, BMTN, and MTL-NAS. (t-w) Adversarial MTL: ASP-MTL, MTAN, RD4MTL, and AAMTRL. (x-y) Graph: ML-GCN and MetaLink.
Refer to caption
(a) Hard sharing.
Refer to caption
(b) Soft sharing.
Refer to caption
(c) Adaptive sharing.
Figure 9. Architecture taxonomy proposed by ruder2017overview for deep multi-task sharing: (a) Hard parameter sharing, (b) soft parameter sharing, and (c) adaptive sharing. The 1D arrows indicate computations within the neural networks involving learnable parameters. The 2D shapes and 3D cubes represent the final responses and extracted features, respectively.
Architecture Taxonomy

The remarkable success of deep MTL can be attributed to the rich extracted representations and their efficient sharing. Multi-task sharing relies on the basic splitting ways of architectures among involved tasks. liu2016recurrent first discuss three different sharing mechanisms based on text classification in Recurrent Neural Networks (RNNs): uniform-, coupled-, and shared-layer architectures. ruder2017overview first organize it into two categories: hard parameter sharing and soft parameter sharing. According to this taxonomy, the uniform-layer architecture falls under hard-parameter sharing, while coupled- and shared-layer architectures are considered soft-parameter sharing. In general, ruder2017overview’s taxonomy has been widely accepted by the research community (vandenhende2021multi). We carry forward this taxonomy and enrich it with more details.

In hard parameter sharing, as shown in Fig. 8(a), different tasks can share identical parameters in shallow layers and maintain their own specific parameters in the splitting heads. As shown in Fig. 6(a), this idea can be dated back to 1990s (bromley1993signature; caruanamultitask; caruana1997multitask) when high-related tasks are introduced into a shared FNNs to serve as inductive bias for each other. Fig. 6(b) shows this idea used in RNNs in a modern way (dong2015multi). CNNs can also adopt hard parameter sharing to perform multiple related tasks. As shown in Fig. 10, TCDCN (zhang2014facial) and Fast RCNN (girshick2014rich; girshick2015fast) are the earliest practice of this idea in computer vision. From a representation learning perspective, shallow layers are typically shared as a feature encoder that extracts common features such as edges and textures. By enriching these common features with more related tasks, deeper layers can help enable multitasking on task-specific heads.

misra2016cross argue that there is no principled way of architecture splitting in hard parameter sharing, and conducted the first empirical study to investigate the performance trade-offs amongst varieties of involved tasks and splitting ways in CNNs. The dependence between involved tasks and the splitting ways of architecture motivates the exploration of an architecture that can capture all possible splittings and thus learn an optimal combination of task-shared and task-specific representations, i.e., soft parameter sharing shown in Fig. 8(b). While hard-parameter sharing requires shallow layers to be identical across tasks, soft-parameter sharing encourages each task to maintain its own shallow layers and leverage features from related tasks during the propagation to capture similarities. These feature propagation techniques include but are not limited to fusion, aggregation, attention, etc. However, whether employing hard or soft parameter sharing, exploring the MTL architecture space still remains error-prone.. First of all, this space for deep neural architectures grows exponentially with depth, and incorporating more tasks significantly expands the range of optimal solutions. On the other hand, while hard parameter sharing compresses the model size, leading to a sub-optimal solution, soft parameter sharing ensures advancement by maintaining the maximum total model size, allowing each task to learn a specific architecture in contrast to STL. An adaptive architecture search in a greedy manner during the neural network training process shows promise. As shown in Fig. 8(c) the adaptive parameter sharing, each path from the different layers of different tasks is active before training. The connections vanish with the pursuit of model compression in the process of multi-task optimization, and usually, a thin network is finalized after this dynamic branching procedure.

Table 6. Summary of notations used in Sec. 2.2.
Notation Description
b,B𝑏𝐵b,Bitalic_b , italic_B Batch size.
lr𝑙𝑟lritalic_l italic_r Learning rate.
𝒳lt(B×)H×W×C{\mathcal{X}}_{l}^{t}\in\mathbb{R}^{(B\times)H\times W\times C}caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_B × ) italic_H × italic_W × italic_C end_POSTSUPERSCRIPT Feature maps output from l𝑙litalic_l-th layer of t𝑡titalic_t-th task, where (B,)H,W,C(B,)H,W,C( italic_B , ) italic_H , italic_W , italic_C are (batch size,) #height, #width, and #channel.
𝒲S×S×Cin×Cout𝒲superscript𝑆𝑆subscript𝐶insubscript𝐶out{\mathcal{W}}\in\mathbb{R}^{S\times S\times C_{\text{in}}\times C_{\text{out}}}caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_S × italic_C start_POSTSUBSCRIPT in end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT Convolution filter, where S𝑆Sitalic_S denotes the size of filter, and Cin,Coutsubscript𝐶insubscript𝐶outC_{\text{in}},C_{\text{out}}italic_C start_POSTSUBSCRIPT in end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT out end_POSTSUBSCRIPT denote the number of input and output channels, respectively.
exp()exp\text{exp}(\cdot)exp ( ⋅ ) Exponential function.
σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) Sigmoid function, where σ(x)=1/(1+exp(x))𝜎𝑥11exp𝑥\sigma(x)=1/(1+\text{exp}(-x))italic_σ ( italic_x ) = 1 / ( 1 + exp ( - italic_x ) ).
softmax()softmax\text{softmax}(\cdot)softmax ( ⋅ ) Softmax function, where [softmax(𝒙)]j=exp(xj)/iexp(xi)subscriptdelimited-[]softmax𝒙𝑗expsubscript𝑥𝑗subscript𝑖expsubscript𝑥𝑖[\text{softmax}(\boldsymbol{x})]_{j}=\text{exp}(x_{j})/\sum_{i}\text{exp}(x_{i})[ softmax ( bold_italic_x ) ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = exp ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT exp ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for any entry index j𝑗jitalic_j.
sim(,)sim\text{sim}(\cdot,\cdot)sim ( ⋅ , ⋅ ) An arbitrary similarity function, e.g. cosine similarity cos(,)(\cdot,\cdot)( ⋅ , ⋅ ).
direct-product\odot The element-wise dot product.
LN()𝐿𝑁LN(\cdot)italic_L italic_N ( ⋅ ) Layer norm.
MHSA(q,k,v)𝑀𝐻𝑆𝐴𝑞𝑘𝑣MHSA(q,k,v)italic_M italic_H italic_S italic_A ( italic_q , italic_k , italic_v ) Multi-head self-attention operator.
CONV𝒲()𝐶𝑂𝑁subscript𝑉𝒲CONV_{{\mathcal{W}}}(\cdot)italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ( ⋅ ) Convolution operation parametrized by 𝒲𝒲{\mathcal{W}}caligraphic_W.
RESHAPE()𝑅𝐸𝑆𝐻𝐴𝑃𝐸RESHAPE(\cdot)italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( ⋅ ) Reshape operation to rearrange the original feature maps in H×W×Csuperscript𝐻𝑊𝐶\mathbb{R}^{H\times W\times C}blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT space into a new HW×Csuperscript𝐻𝑊𝐶\mathbb{R}^{HW\times C}blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_C end_POSTSUPERSCRIPT space.

Unless explicitly stated otherwise, we employ the notation provided in Tab. 6 within the context of DL settings to expand upon and complement the information presented in Tab. 3.

Refer to caption
(a) TCDCN.
Refer to caption
(b) Fast RCNN.
Figure 10. Two of the earliest applications of hard-parameter sharing in CNNs: (A) the Tasks-Constrained Deep Convolutional Network (TCDCN), which jointly extracts common features from human faces for multiple tasks such as landmark detection, head pose estimation, and facial attribute inference; and (B) the Fast Region-based Convolutional Network method (Fast R-CNN), where each region of interest (RoI) is projected into a fixed-size feature map first and then mapped to a feature vector used for both object probability prediction and bounding-box offsets regression.

2.2.1. Feature fusion

Feature fusing is a common technique used in MTL to fuse features extracted under the supervision of different tasks, which can leverage shared and private knowledge across tasks. This technique allows each network to better exploit the relationships between tasks and thus improve overall performance. In general, feature fusion in MTL involves weighted summation, concatenation, or a combination of both. We categorize the feature fusion methods into two classes: parallel sharing, where the feature fusion happens at the same position of layers between tasks, and Non-parallel sharing, in which the permutation of sharing layers may exist. The representative works in the line of parallel sharing include Cross-Stitch Networks (misra2016cross), Sluice Networks (ruder2019latent), and Neural Discriminative Dimensionality Reduction in Convolutional Neural Networks (NDDR-CNN) (gao2019nddr). As research in this direction progresses, an increasing number of learnable parameters are being used to control the fusion process. For example, Cross-Stitch Networks utilize four task-aware parameters, Sluice Networks capture latent subspaces of features via extra parameters, and NDDR-CNN models layer-wise fusion by using 1×1111\times 11 × 1 convolutions. However, expecting task feature hierarchies to align perfectly, even among closely related tasks, is unreasonable. Imposing parallel sharing in these unmatched layers could lead to negative transfer. To remedy this dilemma, Soft Order (meyerson2018beyond) uses a more flexible ordering of shared layers to assemble them in different ways for different tasks.

Parallel sharing. Cross-Stitch Networks (misra2016cross) is a soft parameter-sharing architecture that can learn an optimal combination of task-shared and task-specific representations via four learnable parameters, which is named cross-stitch unit. As shown in Fig. 7(a), the activations from different tasks are linear combined via four parameters (α11,α12,α21,α22)subscript𝛼11subscript𝛼12subscript𝛼21subscript𝛼22(\alpha_{11},\alpha_{12},\alpha_{21},\alpha_{22})( italic_α start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT ). We denote by 𝒳lisuperscriptsubscript𝒳𝑙𝑖{\mathcal{X}}_{l}^{i}caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT the feature maps in the l𝑙litalic_l-th layer of task i𝑖iitalic_i. Then the formalization of the Cross-Stitch unit is

(41) [𝒳l+11𝒳l+12]=[α11α12α21α22][𝒳l1𝒳l2]matrixsuperscriptsubscript𝒳𝑙11superscriptsubscript𝒳𝑙12matrixsubscript𝛼11subscript𝛼12subscript𝛼21subscript𝛼22matrixsuperscriptsubscript𝒳𝑙1superscriptsubscript𝒳𝑙2\begin{bmatrix}{\mathcal{X}}_{l+1}^{1}\\ {\mathcal{X}}_{l+1}^{2}\end{bmatrix}=\begin{bmatrix}\alpha_{11}&\alpha_{12}\\ \alpha_{21}&\alpha_{22}\end{bmatrix}\begin{bmatrix}{\mathcal{X}}_{l}^{1}\\ {\mathcal{X}}_{l}^{2}\end{bmatrix}[ start_ARG start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_α start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_α start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ]

Specifically, the extreme setting of (α11,α12,α21,α22)=(1,0,0,1)subscript𝛼11subscript𝛼12subscript𝛼21subscript𝛼221001(\alpha_{11},\alpha_{12},\alpha_{21},\alpha_{22})=(1,0,0,1)( italic_α start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT ) = ( 1 , 0 , 0 , 1 ) can make certain layers to be non-sharing. From this perspective, the separate STL is a special case of cross-stitch combinations. By varying α1subscript𝛼absent1\alpha_{\cdot 1}italic_α start_POSTSUBSCRIPT ⋅ 1 end_POSTSUBSCRIPT and α2subscript𝛼absent2\alpha_{\cdot 2}italic_α start_POSTSUBSCRIPT ⋅ 2 end_POSTSUBSCRIPT values, this proposed unit can move between task-shared and -specific representations, and even choose a middle ground if necessary.

Sluice Networks (ruder2019latent) learns shared parameters between two BiLSTM-based sequence labeling networks (plank2016multilingual). This work aims to model loosely related tasks with non-overlapping datasets. As shown in Fig. 7(b) a sluice meta-network with two tasks, of which each layer is partitioned into two orthogonal subspaces 𝑮𝑮\boldsymbol{G}bold_italic_G and 𝑮superscript𝑮perpendicular-to\boldsymbol{G}^{\perp}bold_italic_G start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. Accordingly, the activations in the l𝑙litalic_l-th layer of task i𝑖iitalic_i are also partitioned into 𝒳lisuperscriptsubscript𝒳𝑙𝑖{\mathcal{X}}_{l}^{i}caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝒳lisuperscriptsubscript𝒳𝑙superscript𝑖perpendicular-to{\mathcal{X}}_{l}^{i^{\perp}}caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, thus leading to a matrix in 4×4superscript44\mathbb{R}^{4\times 4}blackboard_R start_POSTSUPERSCRIPT 4 × 4 end_POSTSUPERSCRIPT to combine activations from two tasks:

(42) [𝒳l+11𝒳l+11𝒳l+12𝒳l+12]=[α11α11α12α12α11α11α12α12α21α21α22α22α21α21α22α22][𝒳l1𝒳l1𝒳l2𝒳l2]matrixsuperscriptsubscript𝒳𝑙11superscriptsubscript𝒳𝑙1superscript1perpendicular-tosuperscriptsubscript𝒳𝑙12superscriptsubscript𝒳𝑙1superscript2perpendicular-tomatrixsubscript𝛼11subscript𝛼superscript11perpendicular-tosubscript𝛼12subscript𝛼superscript12perpendicular-tosubscript𝛼superscript1perpendicular-to1subscript𝛼superscript1perpendicular-tosuperscript1perpendicular-tosubscript𝛼superscript1perpendicular-to2subscript𝛼superscript1perpendicular-tosuperscript2perpendicular-tosubscript𝛼21subscript𝛼superscript21perpendicular-tosubscript𝛼22subscript𝛼superscript22perpendicular-tosubscript𝛼superscript2perpendicular-to1subscript𝛼superscript2perpendicular-tosuperscript1perpendicular-tosubscript𝛼superscript2perpendicular-to2subscript𝛼superscript2perpendicular-tosuperscript2perpendicular-tomatrixsuperscriptsubscript𝒳𝑙1superscriptsubscript𝒳𝑙superscript1perpendicular-tosuperscriptsubscript𝒳𝑙2superscriptsubscript𝒳𝑙superscript2perpendicular-to\begin{bmatrix}{\mathcal{X}}_{l+1}^{1}\\ {\mathcal{X}}_{l+1}^{1^{\perp}}\\ {\mathcal{X}}_{l+1}^{2}\\ {\mathcal{X}}_{l+1}^{2^{\perp}}\end{bmatrix}=\begin{bmatrix}\alpha_{11}&\alpha% _{11^{\perp}}&\alpha_{12}&\alpha_{12^{\perp}}\\ \alpha_{1^{\perp}1}&\alpha_{1^{\perp}1^{\perp}}&\alpha_{1^{\perp}2}&\alpha_{1^% {\perp}2^{\perp}}\\ \alpha_{21}&\alpha_{21^{\perp}}&\alpha_{22}&\alpha_{22^{\perp}}\\ \alpha_{2^{\perp}1}&\alpha_{2^{\perp}1^{\perp}}&\alpha_{2^{\perp}2}&\alpha_{2^% {\perp}2^{\perp}}\end{bmatrix}\begin{bmatrix}{\mathcal{X}}_{l}^{1}\\ {\mathcal{X}}_{l}^{1^{\perp}}\\ {\mathcal{X}}_{l}^{2}\\ {\mathcal{X}}_{l}^{2^{\perp}}\end{bmatrix}[ start_ARG start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_α start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 11 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 12 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_α start_POSTSUBSCRIPT 1 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 1 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT 1 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 1 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 1 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_α start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 21 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 22 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_α start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT 1 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_α start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ]

Inspired by Cross-stitch networks, these α𝛼\alphaitalic_α values are learnable to control how much to share for task-shared information and how much to preserve for task-specific information. Finally, β𝛽\betaitalic_β parameter (see Fig. 7(b)), through the skip-connections, linearly summarizes the multi-task representations at various levels of the network architecture.

Neural Discriminative Dimensionality Reduction in Convolutional Neural Networks (NDDR-CNN) (gao2019nddr) further concatenates feature maps from different tasks in a channel-wise manner. This NDDR, as shown in Fig. 7(c), can be fulfilled by using simple 1×1111\times 11 × 1 convolutional layer plus batch nomalization layer, and be extended to any end-to-end training CNN in a “plug-and-play” fashion. Considering the number of tasks being T𝑇Titalic_T, we can denote 1×1111\times 11 × 1 convolution by 𝒲1×1×TC×TC𝒲superscript11𝑇𝐶𝑇𝐶\mathcal{W}\in\mathbb{R}^{1\times 1\times TC\times TC}caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 1 × italic_T italic_C × italic_T italic_C end_POSTSUPERSCRIPT, where TC𝑇𝐶TCitalic_T italic_C is the depth of combined feature maps from all tasks. We concatenate feature maps according to the channel dimension and divide 1×1111\times 11 × 1 convolution according to the output dimension by T𝑇Titalic_T tasks as follows:

𝒳l=[𝒳l1,,𝒳lT],𝒲=[𝒲1,,𝒲T],formulae-sequencesubscript𝒳𝑙superscriptsubscript𝒳𝑙1superscriptsubscript𝒳𝑙𝑇𝒲superscript𝒲1superscript𝒲𝑇{\mathcal{X}}_{l}=[{\mathcal{X}}_{l}^{1},\cdots,{\mathcal{X}}_{l}^{T}],{% \mathcal{W}}=[{\mathcal{W}}^{1},\cdots,{\mathcal{W}}^{T}],caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = [ caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] , caligraphic_W = [ caligraphic_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , caligraphic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] ,

where 𝒳lH×W×TCsubscript𝒳𝑙superscript𝐻𝑊𝑇𝐶{\mathcal{X}}_{l}\in\mathbb{R}^{H\times W\times TC}caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_T italic_C end_POSTSUPERSCRIPT and 𝒲t1×1×TC×Csuperscript𝒲𝑡superscript11𝑇𝐶𝐶{\mathcal{W}}^{t}\in\mathbb{R}^{1\times 1\times TC\times C}caligraphic_W start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 1 × italic_T italic_C × italic_C end_POSTSUPERSCRIPT. Then, the output feature maps at the (l+1)𝑙1(l+1)( italic_l + 1 )-th layer for the t𝑡titalic_t-th task can be calculated as

(43) 𝒳l+1t=CONV𝒲t(𝒳l),t=1,,T.formulae-sequencesuperscriptsubscript𝒳𝑙1𝑡𝐶𝑂𝑁subscript𝑉superscript𝒲𝑡subscript𝒳𝑙𝑡1𝑇{\mathcal{X}}_{l+1}^{t}=CONV_{{\mathcal{W}}^{t}}({\mathcal{X}}_{l}),t=1,\cdots% ,T.caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT caligraphic_W start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , italic_t = 1 , ⋯ , italic_T .

The NDDR layer defined by Eq. (43) is a standard 1×1111\times 11 × 1 convolution operation in CNNs. To avoid a trivial solution on 𝒲𝒲{\mathcal{W}}caligraphic_W and the noise directions of learned features, the batch normalization layer is followed after each NDDR layer, and the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT weight decay is applied on the weights of the NDDR layer, respectively.

Unparallel sharing. Soft Order (meyerson2018beyond) learns how shared layers are assembled in permuted ways for different tasks. Specifically, a learnable tensor of scalars SL×L×T𝑆superscript𝐿𝐿𝑇S\in\mathbb{R}^{L\times L\times T}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_L × italic_T end_POSTSUPERSCRIPT, is used to implement the soft ordering, where L𝐿Litalic_L is #layer and T𝑇Titalic_T is #task. For simplicity, consider a hard sharing network with L𝐿Litalic_L shared layers {f𝒲l}l=1Lsuperscriptsubscriptsubscript𝑓subscript𝒲𝑙𝑙1𝐿\{f_{{\mathcal{W}}_{l}}\}_{l=1}^{L}{ italic_f start_POSTSUBSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT (f𝑓fitalic_f can be CONV𝐶𝑂𝑁𝑉CONVitalic_C italic_O italic_N italic_V or Linear function), then the soft ordering of this hard sharing for the t𝑡titalic_t-th task is:

(44) 𝒳lt=j=1Lst,j,lf𝒲j(𝒳l1t),l=1,,L,t=1,,T,s.t.j=1Lst,j,l=1with(t,l),formulae-sequencesuperscriptsubscript𝒳𝑙𝑡superscriptsubscript𝑗1𝐿subscript𝑠𝑡𝑗𝑙subscript𝑓subscript𝒲𝑗superscriptsubscript𝒳𝑙1𝑡formulae-sequence𝑙1𝐿formulae-sequence𝑡1𝑇s.t.superscriptsubscript𝑗1𝐿subscript𝑠𝑡𝑗𝑙1withfor-all𝑡𝑙{\mathcal{X}}_{l}^{t}=\sum\nolimits_{j=1}^{L}s_{t,j,l}f_{{\mathcal{W}}_{j}}({% \mathcal{X}}_{l-1}^{t}),l=1,\cdots,L,t=1,\cdots,T,\quad\text{s.t.}\sum% \nolimits_{j=1}^{L}s_{t,j,l}=1~{}\text{with}~{}\forall(t,l),caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_t , italic_j , italic_l end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_l = 1 , ⋯ , italic_L , italic_t = 1 , ⋯ , italic_T , s.t. ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_t , italic_j , italic_l end_POSTSUBSCRIPT = 1 with ∀ ( italic_t , italic_l ) ,

where st,j,lsubscript𝑠𝑡𝑗𝑙s_{t,j,l}italic_s start_POSTSUBSCRIPT italic_t , italic_j , italic_l end_POSTSUBSCRIPT is the (t,j,l)𝑡𝑗𝑙(t,j,l)( italic_t , italic_j , italic_l )-th entry of the tensor S𝑆Sitalic_S. Fig. 7(d) visualizes this layer permutation operation. It is noticed that the constraint on st,j,l=1subscript𝑠𝑡𝑗𝑙1s_{t,j,l}=1italic_s start_POSTSUBSCRIPT italic_t , italic_j , italic_l end_POSTSUBSCRIPT = 1 for (t,l)for-all𝑡𝑙\forall(t,l)∀ ( italic_t , italic_l ) can be easily implemented via a softmax function. In practice, a dropout operation is beneficial to increasing the generalization capacity of shared representations.

Remarks (i) Feature fusion enables the exploration of multi-task interactions in a "plug and play" manner, making it a general-purpose MTL solution that can be generalized to any backbones. (ii) The feature-level relationships between tasks can be investigated by examining the introduced learnable parameters after training. (iii) Feature fusion cannot reveal what information is propagated during the multitasking process, highlighting the need for design guidelines that go beyond common practices. (iv) Feature fusion inherently imposes constraints on the SIMO setting, as it allows features to be fused only within the same context. (v) Feature fusion creates task-specific branches that also need to learn shared features across tasks, which can hinder task-awareness compared to STL, which focuses on capturing representations specific to the target task.
Refer to caption
(a) Vanilla Cascading.
Refer to caption
(b) Cascading with Prediction Shortcuts.
Refer to caption
(c) Cascading with Prediction and Feature Shortcuts.
Refer to caption
(d) Cascading with Prediction and Residual Shortcuts.
Figure 11. The taxonomy of cascading structures into four categories: (A) the vanilla cascading structure, (B) the cascading structure with prediction shortcuts, (C) the cascading structure with prediction and feature shortcuts, and (D) the cascading structure with prediction and residual shortcuts.

2.2.2. Cascading

Having supervision from all tasks at the outermost level is shown to be sub-optimal, another avenue of investigation for mitigating this parallel sharing is through the implementation of multi-task cascaded learning (sogaard-goldberg-2016-deep). This field of study involves supervising tasks at different levels within their respective layers, facilitating higher-level tasks to effectively leverage the shared representation derived from lower-level tasks. In practice, multi-task cascading can be applied to 1) the complicated task that can be decomposed into several sub-tasks, e.g., instance-aware semantic segmentation decomposed into differentiating instances, estimating masks and categorizing objects in CV (dai2016instance), and 2) a group of hierarchical tasks, e.g., part-of-speech (POS) tagging (word-level), dependency parsing (syntactic-level) and question answering (QA) (semantic-level) in NLP (sogaard-goldberg-2016-deep; hashimoto-jmt:2017:EMNLP2017). In this line of research, early work (sogaard-goldberg-2016-deep) realize cascading by having low-level tasks supervised at shallow layers, and then reusing representations from shallow layers for higher-level tasks. The Joint Many-Task (JMT) model (hashimoto-jmt:2017:EMNLP2017) adds shortcut connections from each lower-level task prediction to higher-level tasks, which can further reflect task hierarchies. Furthermore, shortcut connections in Multi-task Network Cascades (MNCs) (dai2016instance) and Deep Cascade Multi-Task Learning (DCMTL) (gong2019deep) come from both cascade connection (predictions) and residual connection (features). Hierarchical MTL (HMTL) (sanh2019hierarchical) introduces more semantic tasks to share both common embeddings and encoders in a hierarchical cascading architecture.

Vanilla Cascading (sogaard-goldberg-2016-deep) first presents a multi-task learning architecture that utilizes bi-directional RNNs. This architecture enables the supervision of different tasks at various layers, as shown in Fig. 10(a). In this study, the POS task is supervised at the innermost layer, and the syntactic chunking and Combinatory Categorical Grammar (CCG) supertagging join at the outermost layer to utilize the shared representation of the lower-level tasks via a hard parameter sharing. In this case, the incorporation of lower-level task supervision affects the shallow layer parameter updating, which is beneficial to all involved tasks in MTL.

Multi-task Network Cascades (MNCs) (dai2016instance) performs three sub-tasks of the instance-aware semantic segmentation at the different stages and reuses the features of these tasks at different layers. Each of the three stages involves its own predictions of box-level instance proposals, mask-level instance regression, and instance categorization, respectively, and the later task learning relies on previous prediction output. As shown in Fig. 10(b), the innermost features are utilized by all sub-tasks, which is beneficial to both the accuracy and speed in an end-to-end training manner.

Joint Many-Task (JMT) Model (hashimoto-jmt:2017:EMNLP2017) is another cascading model to predict NLP tasks with different linguistic levels of morphology, syntax, and semantics. JMT shares a similar architecture with MNCs, as shown in Fig. 10(c), but each higher-level task contains the shortcut connections from the predictions of all lower-level tasks. In addition, the naïve 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization term is imposed on model weights to allow the improvement of one task without exhibiting catastrophic interference with the other tasks.

Deep Cascade Multi-Task Learning (DCMTL) (gong2019deep) first incorporates both cascade and residual connections. As shown in Fig. 10(d), the cascade connections transmit predictions from lower tasks, while the residual connections transmit inputs from lower layers. It has been validated that these skip connections are effective for strictly ordering tasks. The cascading structure alone proves inadequate for high-level tasks that heavily rely on low-level tasks. In addition, DCMTL can outperform previous SOTA methods and has been deployed on the online shopping assistant of a dominant Chinese E-commerce platform.

Hierarchical Multi-Task Learning (HMTL) (sanh2019hierarchical) is a parallel method trained in a hierarchical fashion. This model can supervise a set of low-level tasks at the bottom layers and more complex tasks at the top layers. Similar to MNCs (dai2016instance), representations extracted at the very beginning are fed into all the successive encoders for different tasks, which is beneficial to the training stability and acceleration. Also shown in Fig. 10(d), HMTL is a variation that parallels high-level tasks could exist, e.g., Coreference Resolution (CR) and Relation Extraction (RE), and more types of word representations like pre-trained GloVe (pennington2014glove) and ELMo (peters-etal-2018-deep) embeddings, are combined to achieve the best performance.

Remarks (i) Cascading facilitates feature communication across different layers. (ii) Cascading enhances the utility of features for tasks at different levels.

2.2.3. Knowledge Distillation (KD)

Motivated by KD (44873) where a teacher model can guide a student model via passing meaningful knowledge (e.g., soft labels), separate models in MTL for different tasks can utilize definite information. Specifically, a teacher model can be trained on multiple tasks that are of interest and then serves as an expert in performing those tasks and possessing versatile knowledge. The knowledge from the teacher model is then transferred to a student model. This can be done by training the student model to mimic the behavior of the teacher model, e.g., the student model learns to predict the outputs or pattern structures of the teacher model on the shared tasks. On the other hand, the student model can be trained jointly on multiple tasks, using both the labeled data for each task and the guidance from the teacher model. The shared information and generalizable representations learned from the teacher model can benefit the student model’s performance on all the tasks. In this manner, the teacher model performs auxiliary tasks to assist the student model in target tasks. For example, the depth prediction from a customized CNN can help the segmentation task via multi-modal distillation (i.e., train with RGB-Depth data instead of RGB data), while the depth prediction is an intermediate auxiliary task to the target segmentation task (xu2018pad). The research in this subfield can be classified into two categories that correspond to the knowledge encompassed within a teacher model: feature-level and response-level. KD4MTL (li2020knowledge) carries forward FitNets (romero2014fitnets) via optimizing the distance between the features of the offline task-specific networks and the online multi-task network. MuST (ghiasi2021multi) and OKD-MTL (jacob2023online) distill the knowledge (i.e., pseudo labels) from pre-trained specialized teachers to general-purpose students. MuST (ghiasi2021multi) pretrains several specialized teachers capable of generating multi-task labels for the target dataset. CrossDistil (yang2022cross) distills the responses of item preference across different tasks in the recommender system.

Feature-Level. Knowledge Distillation for Multi-task Learning (KD4MTL) (li2020knowledge), as shown in Fig. 7(e), first trains an offline task-specific network for each task, and then learns the multi-task network via adding the loss to minimize the distance between the task-specific network and the multi-task network. As the multi-task purpose network is capable of multiple tasks while the task-specific network is more professional at its own task, the two output features cannot be completely matched. Instead, the feature map from multi-task network, denoted by 𝒳H×W×C𝒳superscript𝐻𝑊𝐶\mathcal{X}\in\mathbb{R}^{H\times W\times C}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, is transformed via an adaptor ϕ(t):H×W×C1×1×C×CCONVH×W×C,t=1,,T:superscriptitalic-ϕ𝑡formulae-sequence11𝐶𝐶𝐶𝑂𝑁𝑉superscript𝐻𝑊𝐶superscript𝐻𝑊𝐶𝑡1𝑇\phi^{(t)}:\mathbb{R}^{H\times W\times C}\xrightarrow{1\times 1\times C\times C% ~{}CONV}\mathbb{R}^{H\times W\times C},t=1,\cdots,Titalic_ϕ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT 1 × 1 × italic_C × italic_C italic_C italic_O italic_N italic_V end_OVERACCENT → end_ARROW blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT , italic_t = 1 , ⋯ , italic_T. These adaptors are jointly learned with the multi-task network via the loss function defined as

(45) d=t=1Td(ϕ(t)(𝒳),𝒳~(t)),superscript𝑑superscriptsubscript𝑡1𝑇superscript𝑑superscriptitalic-ϕ𝑡𝒳superscript~𝒳𝑡\mathcal{L}^{d}=\sum\nolimits_{t=1}^{T}\ell^{d}(\phi^{(t)}(\mathcal{X}),\tilde% {\mathcal{X}}^{(t)}),caligraphic_L start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( caligraphic_X ) , over~ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ,

where 𝒳~(t)superscript~𝒳𝑡\tilde{\mathcal{X}}^{(t)}over~ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the feature map from an offline single network corresponding to the task t𝑡titalic_t, and dsuperscript𝑑\ell^{d}roman_ℓ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is defined as the Euclidean distance between the two feature maps that is 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT normalized.

Online KD for MTL (OKD-MTL) (jacob2023online) proposes an online knowledge distillation method to mitigate negative transfer across tasks. The adaptive feature distillation (AFD) loss with an online task weighting (OTW) scheme is designed to selectively train layers for each task. As shown in Fig. 7(g), the critical component AFD is an online weighted knowledge distillation performed on intermediate features from the shared ViT backbone of MTL, and the distilled features are from the teacher model that performs STL on each task. We denote by L𝐿Litalic_L the total number of layers of the ViT encoder backbone and let T𝑇Titalic_T denote the number of tasks. Then the AFD loss is defined as

(46) AFD=l=1L𝒳l¯t=1Twlt𝒳lt22subscriptAFDsuperscriptsubscript𝑙1𝐿subscriptsuperscriptnorm¯subscript𝒳𝑙superscriptsubscript𝑡1𝑇subscriptsuperscript𝑤𝑡𝑙superscriptsubscript𝒳𝑙𝑡22\mathcal{L}_{\text{AFD}}=\sum\nolimits_{l=1}^{L}\|\bar{\mathcal{X}_{l}}-\sum% \nolimits_{t=1}^{T}w^{t}_{l}\mathcal{X}_{l}^{t}\|^{2}_{2}caligraphic_L start_POSTSUBSCRIPT AFD end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ over¯ start_ARG caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

where wltsubscriptsuperscript𝑤𝑡𝑙w^{t}_{l}italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denotes the learnable parameters for the t𝑡titalic_t-th task in the l𝑙litalic_l-th layer, which balances the multiple tasks. 𝒳l¯¯subscript𝒳𝑙\bar{\mathcal{X}_{l}}over¯ start_ARG caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG is the shared features learned from the teacher model at l𝑙litalic_l-th layer. The shared features can be distilled for each task features 𝒳ltsubscriptsuperscript𝒳𝑡𝑙\mathcal{X}^{t}_{l}caligraphic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT through Eq. (46) above. In the framework of OKD-MTL, the STL teacher and MTL students are trained in an end-to-end manner through the total loss

(47) total=AFD+t=1T(STLt+λtMTLt).subscript𝑡𝑜𝑡𝑎𝑙subscript𝐴𝐹𝐷superscriptsubscript𝑡1𝑇superscriptsubscript𝑆𝑇𝐿𝑡subscript𝜆𝑡superscriptsubscript𝑀𝑇𝐿𝑡\mathcal{L}_{total}=\mathcal{L}_{AFD}+\sum_{t=1}^{T}(\mathcal{L}_{STL}^{t}+% \lambda_{t}\mathcal{L}_{MTL}^{t}).caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_A italic_F italic_D end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_S italic_T italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_M italic_T italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) .

To mitigate the gap between the MTL and STL losses, OTW adjusts the task weight λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the t𝑡titalic_t-th task at iteration i𝑖iitalic_i as follows:

(48) λt(i)=Texp(rt(i)/k)s=1Texp(rs(i)/k),rt(i)=MTLt(i)STLt(i),t=1,,T,formulae-sequencesubscript𝜆𝑡𝑖𝑇superscript𝑟𝑡𝑖𝑘superscriptsubscript𝑠1𝑇superscript𝑟𝑠𝑖𝑘formulae-sequencesuperscript𝑟𝑡𝑖superscriptsubscript𝑀𝑇𝐿𝑡𝑖superscriptsubscript𝑆𝑇𝐿𝑡𝑖𝑡1𝑇\lambda_{t}(i)=T\frac{\exp{(r^{t}(i)}/k)}{\sum_{s=1}^{T}\exp{(r^{s}(i)/k)}},r^% {t}(i)=\frac{\mathcal{L}_{MTL}^{t}(i)}{\mathcal{L}_{STL}^{t}(i)},t=1,\cdots,T,italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i ) = italic_T divide start_ARG roman_exp ( italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_i ) / italic_k ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_exp ( italic_r start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_i ) / italic_k ) end_ARG , italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_i ) = divide start_ARG caligraphic_L start_POSTSUBSCRIPT italic_M italic_T italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_i ) end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT italic_S italic_T italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_i ) end_ARG , italic_t = 1 , ⋯ , italic_T ,

where k𝑘kitalic_k serves as the temperature hyperparameter to control this task weighting process, and i𝑖iitalic_i represents the iteration index.

Response-Level. Multi-Task Self-Training (MuST) (ghiasi2021multi) first trains555Pre-trained checkpoints are also recommended to alleviate computational burdens. the classification, detection, and segmentation teacher models from scratch on ImageNet (deng2009imagenet; russakovsky2015imagenet)/JFT-300M (sun2017revisiting), Objects365 (shao2019objects365), and COCO (kirillov2019panoptic), respectively. The knowledge is then transferred from these specialized teachers to a general-purpose student model via pseudo-labeling. Fig. 7(f) shows us as overview of MuST, every image in the shared dataset has supervision for all tasks, either supervised or pseudo labels. To balance these loss functions are tricky (See § 2.2.5) and MuST adopts wi=bslrit/(btlrs)subscript𝑤𝑖superscript𝑏𝑠𝑙superscriptsubscript𝑟𝑖𝑡superscript𝑏𝑡𝑙superscript𝑟𝑠w_{i}=b^{s}{lr}_{i}^{t}/(b^{t}{lr}^{s})italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_l italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT / ( italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_l italic_r start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) (goyal2017accurate) for ImageNet experiments, where b𝑏bitalic_b denotes the batch size, lr𝑙𝑟lritalic_l italic_r denotes the learning rate, the superscript indicates the student or teacher, and the total loss of MTL is defined as total=iwiisubscript𝑡𝑜𝑡𝑎𝑙subscript𝑖subscript𝑤𝑖subscript𝑖\mathcal{L}_{total}=\sum_{i}w_{i}\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For JFT300M, the algorithm in kendall2018multi was used to learn wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each task. For depth loss, the weight wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT was chosen by a parameter sweep. It has been validated that MuST can both rival supervised STL and enhance transfer learning performance.

CrossDistil (Cross-Task Knowledge Distillation) (yang2022cross) proposes a recommender framework that can transfer the fine-grained ranking knowledge about user’s preference towards items, as shown in Fig. 7(h). To facilitate fine-grained ranking, the training samples are divided into multiple subsets, taking into account all possible combinations of the tasks. For instance, in a recommender system where two tasks involve predicting “Buy” and “Like” for an item, the potential task combinations include “Buy:1, Like:1”, “Buy:1, Like:0”, “Buy:0, Like:1”, and “Buy:0, Like:0”. For simplicity, the division of multiple subsets on two tasks are:

(49) {𝒟++={(𝒙i,yi(1),yi(2))𝒟|yi(1)=1,yi(2)=1},𝒟+={(𝒙i,yi(1),yi(2))𝒟|yi(1)=1,yi(2)=0},𝒟+={(𝒙i,yi(1),yi(2))𝒟|yi(1)=0,yi(2)=1},𝒟={(𝒙i,yi(1),yi(2))𝒟|yi(1)=0,yi(2)=0},𝒟+=𝒟++𝒟+,𝒟=𝒟+𝒟,𝒟+=𝒟++𝒟+,𝒟=𝒟+𝒟,casesformulae-sequencesuperscript𝒟absentconditional-setsubscript𝒙𝑖superscriptsubscript𝑦𝑖1superscriptsubscript𝑦𝑖2𝒟formulae-sequencesuperscriptsubscript𝑦𝑖11superscriptsubscript𝑦𝑖21superscript𝒟absentconditional-setsubscript𝒙𝑖superscriptsubscript𝑦𝑖1superscriptsubscript𝑦𝑖2𝒟formulae-sequencesuperscriptsubscript𝑦𝑖11superscriptsubscript𝑦𝑖20otherwiseformulae-sequencesuperscript𝒟absentconditional-setsubscript𝒙𝑖superscriptsubscript𝑦𝑖1superscriptsubscript𝑦𝑖2𝒟formulae-sequencesuperscriptsubscript𝑦𝑖10superscriptsubscript𝑦𝑖21superscript𝒟absentconditional-setsubscript𝒙𝑖superscriptsubscript𝑦𝑖1superscriptsubscript𝑦𝑖2𝒟formulae-sequencesuperscriptsubscript𝑦𝑖10superscriptsubscript𝑦𝑖20otherwiseformulae-sequencesuperscript𝒟absentsuperscript𝒟absentsuperscript𝒟absentformulae-sequencesuperscript𝒟absentsuperscript𝒟absentsuperscript𝒟absentformulae-sequencesuperscript𝒟absentsuperscript𝒟absentsuperscript𝒟absentsuperscript𝒟absentsuperscript𝒟absentsuperscript𝒟absentotherwise\begin{cases}\mathcal{D}^{++}=\{(\boldsymbol{x}_{i},y_{i}^{(1)},y_{i}^{(2)})% \in\mathcal{D}|y_{i}^{(1)}=1,y_{i}^{(2)}=1\},\mathcal{D}^{+-}=\{(\boldsymbol{x% }_{i},y_{i}^{(1)},y_{i}^{(2)})\in\mathcal{D}|y_{i}^{(1)}=1,y_{i}^{(2)}=0\},\\ \mathcal{D}^{-+}=\{(\boldsymbol{x}_{i},y_{i}^{(1)},y_{i}^{(2)})\in\mathcal{D}|% y_{i}^{(1)}=0,y_{i}^{(2)}=1\},\mathcal{D}^{--}=\{(\boldsymbol{x}_{i},y_{i}^{(1% )},y_{i}^{(2)})\in\mathcal{D}|y_{i}^{(1)}=0,y_{i}^{(2)}=0\},\\ \mathcal{D}^{+\cdot}=\mathcal{D}^{++}\cup\mathcal{D}^{+-},\mathcal{D}^{-\cdot}% =\mathcal{D}^{-+}\cup\mathcal{D}^{--},\mathcal{D}^{\cdot+}=\mathcal{D}^{++}% \cup\mathcal{D}^{-+},\mathcal{D}^{\cdot-}=\mathcal{D}^{+-}\cup\mathcal{D}^{--}% ,\end{cases}{ start_ROW start_CELL caligraphic_D start_POSTSUPERSCRIPT + + end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) ∈ caligraphic_D | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 1 , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 1 } , caligraphic_D start_POSTSUPERSCRIPT + - end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) ∈ caligraphic_D | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 1 , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 0 } , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL caligraphic_D start_POSTSUPERSCRIPT - + end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) ∈ caligraphic_D | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 0 , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 1 } , caligraphic_D start_POSTSUPERSCRIPT - - end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) ∈ caligraphic_D | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 0 , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 0 } , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL caligraphic_D start_POSTSUPERSCRIPT + ⋅ end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUPERSCRIPT + + end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT + - end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT - ⋅ end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUPERSCRIPT - + end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT - - end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT ⋅ + end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUPERSCRIPT + + end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT - + end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT ⋅ - end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUPERSCRIPT + - end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT - - end_POSTSUPERSCRIPT , end_CELL start_CELL end_CELL end_ROW

where 𝒙𝒙\boldsymbol{x}bold_italic_x represents the input feature vector from the whole dataset 𝒟𝒟\mathcal{D}caligraphic_D.

We denote by 𝒙++𝒟++superscript𝒙absentsuperscript𝒟absent\boldsymbol{x}^{++}\in\mathcal{D}^{++}bold_italic_x start_POSTSUPERSCRIPT + + end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUPERSCRIPT + + end_POSTSUPERSCRIPT and so forth. The fine-grained ranking considers the corresponding multipartite order 𝒙++𝒙+𝒙+𝒙succeedssuperscript𝒙absentsuperscript𝒙absentsucceedssuperscript𝒙absentsucceedssuperscript𝒙absent\boldsymbol{x}^{++}\succ\boldsymbol{x}^{+-}\succ\boldsymbol{x}^{-+}\succ% \boldsymbol{x}^{--}bold_italic_x start_POSTSUPERSCRIPT + + end_POSTSUPERSCRIPT ≻ bold_italic_x start_POSTSUPERSCRIPT + - end_POSTSUPERSCRIPT ≻ bold_italic_x start_POSTSUPERSCRIPT - + end_POSTSUPERSCRIPT ≻ bold_italic_x start_POSTSUPERSCRIPT - - end_POSTSUPERSCRIPT instead of bipartite orders, e.g., 𝒙+𝒙succeedssuperscript𝒙absentsuperscript𝒙absent\boldsymbol{x}^{+\cdot}\succ\boldsymbol{x}^{-\cdot}bold_italic_x start_POSTSUPERSCRIPT + ⋅ end_POSTSUPERSCRIPT ≻ bold_italic_x start_POSTSUPERSCRIPT - ⋅ end_POSTSUPERSCRIPT oder 𝒙+𝒙succeedssuperscript𝒙absentsuperscript𝒙absent\boldsymbol{x}^{\cdot+}\succ\boldsymbol{x}^{\cdot-}bold_italic_x start_POSTSUPERSCRIPT ⋅ + end_POSTSUPERSCRIPT ≻ bold_italic_x start_POSTSUPERSCRIPT ⋅ - end_POSTSUPERSCRIPT, which may be contradictory among different tasks. Based on the fine-grained ranking, an augmented loss is introduced for each task as

(50) aug=(𝒙++,𝒙+,𝒙+,𝒙)[β1lnσ(r^+++)+β2lnσ(r^+)](𝒙+,𝒙)lnσ(r^+),subscriptaugsubscriptsuperscript𝒙absentsuperscript𝒙absentsuperscript𝒙absentsuperscript𝒙absentdelimited-[]subscript𝛽1𝜎subscript^𝑟succeedsabsentsubscript𝛽2𝜎subscript^𝑟succeedsabsentsubscriptsubscript𝒙absentsubscript𝒙absent𝜎subscript^𝑟succeedsabsent\mathcal{L}_{\text{aug}}=-\sum\nolimits_{(\boldsymbol{x}^{++},\boldsymbol{x}^{% +-},\boldsymbol{x}^{-+},\boldsymbol{x}^{--})}[\beta_{1}\ln{\sigma{(\hat{r}_{++% \succ+-})}}+\beta_{2}\ln{\sigma{(\hat{r}_{-+\succ--})}}]-\sum\nolimits_{(% \boldsymbol{x}_{+\cdot},\boldsymbol{x}_{-\cdot})}\ln{\sigma(\hat{r}_{+\cdot% \succ-\cdot})},caligraphic_L start_POSTSUBSCRIPT aug end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT + + end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT + - end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT - + end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT - - end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ln italic_σ ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT + + ≻ + - end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ln italic_σ ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT - + ≻ - - end_POSTSUBSCRIPT ) ] - ∑ start_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT + ⋅ end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT - ⋅ end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_ln italic_σ ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT + ⋅ ≻ - ⋅ end_POSTSUBSCRIPT ) ,

where β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are two hyper-parameters to balance the importance of pair-wise ranking relations and r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG is the logit value before the sigmoid function σ𝜎\sigmaitalic_σ. Additionally, r^++=r^++r^subscript^𝑟succeedsabsentsubscript^𝑟absentsubscript^𝑟absent\hat{r}_{++\succ--}=\hat{r}_{++}-\hat{r}_{--}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT + + ≻ - - end_POSTSUBSCRIPT = over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT - over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT - - end_POSTSUBSCRIPT and so forth. In contrast, the original regression-based loss function for each task is

(51) CE=𝒙i𝒟[yilnσ(ri^)+(1yi)ln(1σ(ri^))].subscriptCEsubscriptsubscript𝒙𝑖𝒟delimited-[]subscript𝑦𝑖𝜎^subscript𝑟𝑖1subscript𝑦𝑖1𝜎^subscript𝑟𝑖\mathcal{L}_{\text{CE}}=-\sum\nolimits_{\boldsymbol{x}_{i}\in\mathcal{D}}[y_{i% }\ln\sigma(\hat{r_{i}})+(1-y_{i})\ln(1-\sigma(\hat{r_{i}}))].caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D end_POSTSUBSCRIPT [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_ln italic_σ ( over^ start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) + ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln ( 1 - italic_σ ( over^ start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ) ] .

Based on Eqs. (50) and (51), CrossDistil regards the learning task of augmented loss as teachers and the learning task of regression-based loss as students, the distillation loss for each of task is

(52) KD=𝒙i𝒟[σ(ri~/τ)lnσ(ri^/τ)+(1σ(ri~/τ))ln(1σ(ri^/τ))],subscriptKDsubscriptsubscript𝒙𝑖𝒟delimited-[]𝜎~subscript𝑟𝑖𝜏𝜎^subscript𝑟𝑖𝜏1𝜎~subscript𝑟𝑖𝜏1𝜎^subscript𝑟𝑖𝜏\mathcal{L}_{\text{KD}}=-\sum\nolimits_{\boldsymbol{x}_{i}\in\mathcal{D}}[% \sigma(\tilde{{r_{i}}}/\tau)\ln\sigma(\hat{r_{i}}/\tau)+(1-\sigma(\tilde{r_{i}% }/\tau))\ln(1-\sigma(\hat{r_{i}}/\tau))],caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D end_POSTSUBSCRIPT [ italic_σ ( over~ start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG / italic_τ ) roman_ln italic_σ ( over^ start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG / italic_τ ) + ( 1 - italic_σ ( over~ start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG / italic_τ ) ) roman_ln ( 1 - italic_σ ( over^ start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG / italic_τ ) ) ] ,

where ri~~subscript𝑟𝑖\tilde{r_{i}}over~ start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG is learned and calibrated from Eq. (50), and an error correction mechanism is applied to ensure its alignment with the hard label yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The original regression loss and knowledge distillation loss contribute to the learning of students for multiple tasks as

(53) MT=t=1T[(1α(t))CE(t)+α(t)KD(t)],subscriptMTsuperscriptsubscript𝑡1𝑇delimited-[]1superscript𝛼𝑡superscriptsubscriptCE𝑡superscript𝛼𝑡superscriptsubscriptKD𝑡\mathcal{L}_{\text{MT}}=\sum\nolimits_{t=1}^{T}[(1-\alpha^{(t)})\mathcal{L}_{% \text{CE}}^{(t)}+\alpha^{(t)}\mathcal{L}_{\text{KD}}^{(t)}],caligraphic_L start_POSTSUBSCRIPT MT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ ( 1 - italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ] ,

where α(t),t=1,,T[0,1]formulae-sequencesuperscript𝛼𝑡𝑡1𝑇01\alpha^{(t)},t=1,\cdots,T\in[0,1]italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_t = 1 , ⋯ , italic_T ∈ [ 0 , 1 ] is a hyper-parameter to balance two loss functions. In this manner, by distilling the fine-grained ranking of task combinations, cross-task knowledge is effectively transferred.

2.2.4. Cross-Task Attention

Attention mechanism (niu2021review; brauwers2021general; guo2022attention) has been one of the most crucial concepts in RNNs, CNNs, and Transformers over the past decade in DL. Generally, attention is an information aggregation technique inspired by a human recognition system that tends to prioritize part of local regions over others when processing rich information. Under MTL settings, features from different tasks are more abundant than in STL, thus leading to a natural integration of the attention mechanism. Cross-task attention (bruggemann2021exploring), encoding task-aware features into cross-task queries, can perform task-association via refinement of multi-source features. Unlike feature fusion methods (misra2016cross; ruder2019latent; gao2019nddr) that propagate task-shared information among different task-specific branches, cross-task attention calculates what/how to share based on cross-task comparison between source tasks and target task. Considering the "morphological" aspect, the hard compartmentalization effect caused by a block-structured communication matrix in feature fusion methods could preserve the interference of features in some cases for tasks. This dilemma could be alleviated with a soft, learnable form of task-aware feature attention. Early works (xu2018pad; liu2019end; zhang2019pattern; zhou2020pattern; bruggemann2021exploring) build naïve attention modules (e.g., sigmoid function or inner product) to refine feature affinity or capture relational contexts across tasks, and then locate/diffuse features according to the attention map. PAD-Net (xu2018pad) and MTAN (liu2019end) select attentive features via an attention mask after the sigmoid activation. PAP (zhang2019pattern) and PSD (zhou2020pattern) iteratively diffuse features based on a cross-task affinity matrix. MTI-Net (vandenhende2020mti) first considers task interactions at multiple scales using both Sigmoid function and squeeze-and-excitation block (hu2018squeeze).

Transformer-based works exploit long-range dependencies using self-attention mechanisms.

Remarks (i) Cross-task attention allows the model to focus on features that are more relevant to each specific task. This targeted attention helps in better feature extraction and can lead to improved task-specific performance, especially when tasks are related but not identical. (ii) The attention mechanism can adaptively weigh the contribution of each task during training, allowing for flexible balancing based on task difficulty or the amount of available data. (iii) Cross-task attention is a lightweight module that can leverage source-target pairwise similarity to refine task-specific features. (iv) Compared with direct feature fusion, the addition of attention mechanisms can lead to over-parameterization if not managed carefully, where the model has more parameters than necessary, complicating the learning process and increasing the risk of overfitting on the tasks with limited data.

Feature Filtering. Multi-Task Guided Prediction-And-Distillation Network (PAD-Net) (xu2018pad) utilizes the predictions from hierarchical auxiliary tasks as multi-modal inputs to distill knowledge for the final tasks. As shown in Fig. 7(i), the framework of PAD-Net, a hard parameter sharing-based encoder, extracts common feature maps that can be used for different tasks, and then the decoder for each auxiliary task generates intermediate predictions for the usage of multi-modal distillation. The source paper proposes three distillation modules to incorporate useful multi-modal information for the final tasks. Suppose the feature maps from s𝑠sitalic_s-th task at l𝑙litalic_l-th layer is denoted as 𝒳lsH×W×C,s=1,,Tformulae-sequencesubscriptsuperscript𝒳𝑠𝑙superscript𝐻𝑊𝐶𝑠1𝑇\mathcal{X}^{s}_{l}\in\mathbb{R}^{H\times W\times C},s=1,\cdots,Tcaligraphic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT , italic_s = 1 , ⋯ , italic_T, which are transformed from predictions of s𝑠sitalic_s-th task via convolutional layers. The output feature maps for the usage of t𝑡titalic_t-th task after the multi-modal distillation is represented as 𝒳l+1o,tsubscriptsuperscript𝒳𝑜𝑡𝑙1\mathbf{\mathcal{X}}^{o,t}_{l+1}caligraphic_X start_POSTSUPERSCRIPT italic_o , italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT.

The first way to perform cross-modal distillation is a naïve concatenation via 𝒳l+1o=[𝒳l1,,𝒳lT]subscriptsuperscript𝒳𝑜𝑙1subscriptsuperscript𝒳1𝑙subscriptsuperscript𝒳𝑇𝑙\mathbf{\mathcal{X}}^{o}_{l+1}=[\mathbf{\mathcal{X}}^{1}_{l},\cdots,\mathbf{% \mathcal{X}}^{T}_{l}]caligraphic_X start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT = [ caligraphic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , ⋯ , caligraphic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ] H×W×TCabsentsuperscript𝐻𝑊𝑇𝐶\in\mathbb{R}^{H\times W\times TC}∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_T italic_C end_POSTSUPERSCRIPT, which is then fed into the separate decoders for each task. Differently, the second way refines feature 𝒳ltsubscriptsuperscript𝒳𝑡𝑙\mathbf{\mathcal{X}}^{t}_{l}caligraphic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT via passing knowledge from other tasks as below:

(54) 𝒳l+1o,t=𝒳lt+stTCONV𝒲st(𝒳ls),t=1,,T,formulae-sequencesubscriptsuperscript𝒳𝑜𝑡𝑙1subscriptsuperscript𝒳𝑡𝑙superscriptsubscript𝑠𝑡𝑇𝐶𝑂𝑁subscript𝑉superscript𝒲𝑠𝑡subscriptsuperscript𝒳𝑠𝑙𝑡1𝑇\mathbf{\mathcal{X}}^{o,t}_{l+1}=\mathbf{\mathcal{X}}^{t}_{l}+\sum\nolimits_{s% \neq t}^{T}CONV_{{\mathcal{W}}^{s\rightarrow t}}(\mathcal{X}^{s}_{l}),t=1,% \cdots,T,caligraphic_X start_POSTSUPERSCRIPT italic_o , italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT = caligraphic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_s ≠ italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT caligraphic_W start_POSTSUPERSCRIPT italic_s → italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

where 𝒲stsuperscript𝒲𝑠𝑡{\mathcal{W}}^{s\rightarrow t}caligraphic_W start_POSTSUPERSCRIPT italic_s → italic_t end_POSTSUPERSCRIPT denotes the weight tensor of convolutions that maps the s𝑠sitalic_s-th task to the t𝑡titalic_t-th task. Furthermore, the third way utilizes the sigmoid function to filter the passing knowledge, which learns an attention map 𝐆tsuperscript𝐆𝑡\mathbf{G}^{t}bold_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for the t𝑡titalic_t-th task as follows:

(55) 𝐆t=σ(CONV𝒲t(𝒳lt)),t=1,,T.formulae-sequencesuperscript𝐆𝑡𝜎𝐶𝑂𝑁subscript𝑉superscript𝒲𝑡subscriptsuperscript𝒳𝑡𝑙𝑡1𝑇\mathbf{G}^{t}=\sigma(CONV_{{\mathcal{W}}^{t}}(\mathcal{X}^{t}_{l})),t=1,% \cdots,T.bold_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_σ ( italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT caligraphic_W start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) , italic_t = 1 , ⋯ , italic_T .

Then the knowledge is filtered via this attention map as follows:

(56) 𝒳l+1o,t=𝒳lt+stT𝐆tCONV𝒲st(𝒳ls),t=1,,T.formulae-sequencesubscriptsuperscript𝒳𝑜𝑡𝑙1subscriptsuperscript𝒳𝑡𝑙superscriptsubscript𝑠𝑡𝑇direct-productsuperscript𝐆𝑡𝐶𝑂𝑁subscript𝑉superscript𝒲𝑠𝑡subscriptsuperscript𝒳𝑠𝑙𝑡1𝑇\mathbf{\mathcal{X}}^{o,t}_{l+1}=\mathbf{\mathcal{X}}^{t}_{l}+\sum\nolimits_{s% \neq t}^{T}\mathbf{G}^{t}\odot CONV_{{\mathcal{W}}^{s\rightarrow t}}(\mathcal{% X}^{s}_{l}),t=1,\cdots,T.caligraphic_X start_POSTSUPERSCRIPT italic_o , italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT = caligraphic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_s ≠ italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊙ italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT caligraphic_W start_POSTSUPERSCRIPT italic_s → italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , italic_t = 1 , ⋯ , italic_T .

After the multi-modal distillation, the distilled feature maps are up-sampled for the final pixel-level prediction tasks.

Multi-Task Attention Network (MTAN) (liu2019end) presents a novel MTL architecture based on task-specific feature-wise attention, while global features are shared across different tasks. Suppose the shared global features are denoted by 𝒳lsubscript𝒳𝑙\mathcal{X}_{l}caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT at the l𝑙litalic_l-th layer, and the features learned from task t𝑡titalic_t are denoted by 𝒳ltsuperscriptsubscript𝒳𝑙𝑡\mathcal{X}_{l}^{t}caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Then the feature-wise attention on the global feature pool is computed as follows:

(57) 𝒳l+1t=σ(𝒳lt)𝒳l,superscriptsubscript𝒳𝑙1𝑡direct-product𝜎superscriptsubscript𝒳𝑙𝑡subscript𝒳𝑙\mathcal{X}_{l+1}^{t}=\sigma(\mathcal{X}_{l}^{t})\odot\mathcal{X}_{l},caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_σ ( caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊙ caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ,

where 𝒳l+1tsuperscriptsubscript𝒳𝑙1𝑡\mathcal{X}_{l+1}^{t}caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is then concatenated with the features from the global pool again and fed into the task-specific convolution blocks. The attention map σ(𝒳lt)𝜎superscriptsubscript𝒳𝑙𝑡\sigma(\mathcal{X}_{l}^{t})italic_σ ( caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is learned in an end-to-end fashion as a parameter-free activation function.

To make the learning process more balanced between different tasks, liu2019end also suggests a simple yet effective Dynamic Weight Average (DWA) strategy (See § 2.2.5) to adjust losses according to their magnitudes in different epochs.

Multi-Scale Task Interaction Networks (MTI-Net) (vandenhende2020mti) aggregates multi-modal features at different scales from the decoder. As shown in Fig. 7(k), features at each scale are transformed and distilled by the feature propagation module and multi-modal distillation, respectively. This allows the model to capture task interactions at multiple scales. As the higher resolution scales have a limited receptive field, low-quality task-related features are presented. Simple upsampling and passing of task-related features from lower scales to higher scales (ronneberger2015u) inspire the design of the Feature Propagation Module (FPM). In this manner, features from different tasks at each scale are harmonized via the traditional convolutions and activation functions. To obtain the task-attentive features, a Sigmoid function along the task dimension is inserted to generate a task attention mask. To remedy the negative transfer among unrelated tasks, a per-task channel gating mechanism (SE, i.e. Squeeze-And-Excitation module (hu2018squeeze)) is used to refine the shared representations.

Furthermore, suppose the feature maps for the task s𝑠sitalic_s at scale l({1/4,1/8,1/16,1/32})annotated𝑙absent1418116132l~{}(\in\{1/4,1/8,1/16,1/32\})italic_l ( ∈ { 1 / 4 , 1 / 8 , 1 / 16 , 1 / 32 } ) represented by 𝒳ls,s=1,,Tformulae-sequencesubscriptsuperscript𝒳𝑠𝑙𝑠1𝑇\mathcal{X}^{s}_{l},s=1,\cdots,Tcaligraphic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_s = 1 , ⋯ , italic_T, then the per-scale multi-modal distillation process for task t𝑡titalic_t is repeated as follows:

(58) 𝒳lt=𝒳lt+stσ(CONV𝒲lst(𝒳ls))CONV𝒲^lst(𝒳ls),t=1,,T,formulae-sequencesubscriptsuperscript𝒳𝑡𝑙subscriptsuperscript𝒳𝑡𝑙subscript𝑠𝑡𝜎𝐶𝑂𝑁subscript𝑉subscriptsuperscript𝒲𝑠𝑡𝑙subscriptsuperscript𝒳𝑠𝑙𝐶𝑂𝑁subscript𝑉subscriptsuperscript^𝒲𝑠𝑡𝑙subscriptsuperscript𝒳𝑠𝑙𝑡1𝑇\mathcal{X}^{t}_{l}=\mathcal{X}^{t}_{l}+\sum\nolimits_{s\neq t}\sigma(CONV_{{% \mathcal{W}}^{s\rightarrow t}_{l}}(\mathcal{X}^{s}_{l}))CONV_{\hat{{\mathcal{W% }}}^{s\rightarrow t}_{l}}(\mathcal{X}^{s}_{l}),t=1,\cdots,T,caligraphic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = caligraphic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_s ≠ italic_t end_POSTSUBSCRIPT italic_σ ( italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT caligraphic_W start_POSTSUPERSCRIPT italic_s → italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT over^ start_ARG caligraphic_W end_ARG start_POSTSUPERSCRIPT italic_s → italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

where the Sigmoid function σ𝜎\sigmaitalic_σ produces a spatial-wise attention mask to filter the features at different scales. 𝒲lstsubscriptsuperscript𝒲𝑠𝑡𝑙{\mathcal{W}}^{s\rightarrow t}_{l}caligraphic_W start_POSTSUPERSCRIPT italic_s → italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and 𝒲^lstsubscriptsuperscript^𝒲𝑠𝑡𝑙\hat{{\mathcal{W}}}^{s\rightarrow t}_{l}over^ start_ARG caligraphic_W end_ARG start_POSTSUPERSCRIPT italic_s → italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denote the weights to map features before attention. The FPM and multi-scale multi-modal distillation result in distilled cross-task features at every scale, which are then fed into the final aggregation module. The predictions are based on decoding these final representations via a task-specific head for each task.

Feature Diffusion. Pattern-Affinitive Propagation (PAP) (zhang2019pattern) builds a cross-task affinity matrix based on a spatial-wise attention mechanism and then iteratively diffuses features on each of the tasks to refine affinitive patterns among tasks. The detailed architecture is shown in Fig. 7(l). Suppose the feature maps before the computing of task-specific affinity matrix are denoted by 𝒳ltH×W×Csubscriptsuperscript𝒳𝑡𝑙superscript𝐻𝑊𝐶\mathcal{X}^{t}_{l}\in\mathbb{R}^{H\times W\times C}caligraphic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, the affinity matrix for each task is computed using the inner product between each pair of spatial-wise feature vector with the length of C𝐶Citalic_C:

(59) 𝑿lt=RESHAPE(𝒳lt)HW×C,𝑴t=𝑿lt𝑿ltHW×HW,t=1,,T,formulae-sequencesubscriptsuperscript𝑿𝑡𝑙𝑅𝐸𝑆𝐻𝐴𝑃𝐸subscriptsuperscript𝒳𝑡𝑙superscript𝐻𝑊𝐶superscript𝑴𝑡subscriptsuperscript𝑿𝑡𝑙superscriptsubscriptsuperscript𝑿𝑡𝑙topsuperscript𝐻𝑊𝐻𝑊𝑡1𝑇\boldsymbol{X}^{t}_{l}=RESHAPE(\mathcal{X}^{t}_{l})\in\mathbb{R}^{HW\times C},% \boldsymbol{M}^{t}=\boldsymbol{X}^{t}_{l}{\boldsymbol{X}^{t}_{l}}^{\top}\in% \mathbb{R}^{HW\times HW},t=1,\cdots,T,bold_italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_C end_POSTSUPERSCRIPT , bold_italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = bold_italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_H italic_W end_POSTSUPERSCRIPT , italic_t = 1 , ⋯ , italic_T ,

where RESHAPE()𝑅𝐸𝑆𝐻𝐴𝑃𝐸RESHAPE(\cdot)italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( ⋅ ) is used to preserve the channel dimension. If the affinity matrix of each task is weighted by a learnable parameter αt(t=1,,T,andt=1Tαt=1)subscript𝛼𝑡formulae-sequence𝑡1𝑇andsuperscriptsubscript𝑡1𝑇subscript𝛼𝑡1\alpha_{t}(t=1,\cdots,T,\text{and}\sum\nolimits_{t=1}^{T}\alpha_{t}=1)italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t = 1 , ⋯ , italic_T , and ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 ), then the final affinity matrix for the task t𝑡titalic_t can be adaptively combined as follows:

(60) 𝑴^s=t=1Tαts𝑴t,s=1,,T,formulae-sequencesuperscript^𝑴𝑠superscriptsubscript𝑡1𝑇superscriptsubscript𝛼𝑡𝑠superscript𝑴𝑡𝑠1𝑇\hat{\boldsymbol{M}}^{s}=\sum\nolimits_{t=1}^{T}\alpha_{t}^{s}\boldsymbol{M}^{% t},s=1,\cdots,T,over^ start_ARG bold_italic_M end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s = 1 , ⋯ , italic_T ,

which is an adaptive combination process that can propagate the cross-task affinitive patterns for the target s𝑠sitalic_s-th task. Furthermore, the cross-task affinitive patterns are used to iteratively diffuse features for each task:

(61) 𝑿lt(i+1)=𝑴^t𝑿lt(i),t=1,,T,i=0,1,,imax,formulae-sequencesubscriptsuperscript𝑿𝑡𝑙𝑖1superscript^𝑴𝑡subscriptsuperscript𝑿𝑡𝑙𝑖formulae-sequence𝑡1𝑇𝑖01subscript𝑖max\boldsymbol{X}^{t}_{l}(i+1)=\hat{\boldsymbol{M}}^{t}\cdot\boldsymbol{X}^{t}_{l% }(i),t=1,\cdots,T,i=0,1,\cdots,i_{\text{max}},bold_italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_i + 1 ) = over^ start_ARG bold_italic_M end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ bold_italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_i ) , italic_t = 1 , ⋯ , italic_T , italic_i = 0 , 1 , ⋯ , italic_i start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ,

where i𝑖iitalic_i denotes the diffusion step. In general, the multi-step iterative diffusion process propagates the affinity information best. Suppose the maximum of step is imaxsubscript𝑖maxi_{\text{max}}italic_i start_POSTSUBSCRIPT max end_POSTSUBSCRIPT, finally the feature maps in the next layer are computed as follows:

(62) 𝑿l+1t=β𝑿lt(imax)+(1β)𝑿lt(0),𝒳l+1t=RESHAPE(𝑿l+1t)H×W×C,t=1,,T,formulae-sequenceformulae-sequencesubscriptsuperscript𝑿𝑡𝑙1𝛽subscriptsuperscript𝑿𝑡𝑙subscript𝑖max1𝛽subscriptsuperscript𝑿𝑡𝑙0subscriptsuperscript𝒳𝑡𝑙1𝑅𝐸𝑆𝐻𝐴𝑃𝐸subscriptsuperscript𝑿𝑡𝑙1superscript𝐻𝑊𝐶𝑡1𝑇\boldsymbol{X}^{t}_{l+1}=\beta\cdot\boldsymbol{X}^{t}_{l}(i_{\text{max}})+(1-% \beta)\cdot\boldsymbol{X}^{t}_{l}(0),\mathcal{X}^{t}_{l+1}=RESHAPE(\boldsymbol% {X}^{t}_{l+1})\in\mathbb{R}^{H\times W\times C},t=1,\cdots,T,bold_italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT = italic_β ⋅ bold_italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) + ( 1 - italic_β ) ⋅ bold_italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( 0 ) , caligraphic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT = italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( bold_italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT , italic_t = 1 , ⋯ , italic_T ,

where β𝛽\betaitalic_β is a hyperparameter to control the feature consistency.

Pattern-Structure Diffusion (PSD) (zhou2020pattern) utilizes a shared CNN encoder to extract feature maps that can be fed into the task-specific decoders, where the pattern structures are distilled within intra-task and across inter-task. As shown in Fig. 7(m), the intra-task PSD is used to transmit pattern structure within each task to enhance the task-specific patterns and then connect with inter-task PSD to correlate relations of pattern structures across different tasks. Without loss of generality, we assume a l×l𝑙𝑙l\times litalic_l × italic_l patch cropped at each position of feature maps 𝒳H×W×C𝒳superscript𝐻𝑊𝐶\mathcal{X}\in\mathbb{R}^{H\times W\times C}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT as 𝒳Pil×l×Csubscript𝒳subscript𝑃𝑖superscript𝑙𝑙𝐶\mathcal{X}_{P_{i}}\in\mathbb{R}^{l\times l\times C}caligraphic_X start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × italic_l × italic_C end_POSTSUPERSCRIPT, where Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT means the pattern at position i𝑖iitalic_i. Then the pattern structure can be defined from the KNN graph on l×l𝑙𝑙l\times litalic_l × italic_l points within 𝒳Pisubscript𝒳subscript𝑃𝑖\mathcal{X}_{P_{i}}caligraphic_X start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT as follows:

(63) [𝑨Pi]j,k=exp{RESHAPE(𝒳Pi)jRESHAPE(𝒳Pi)k22/τ2},i=1,,HW,j,k=1,,l2,formulae-sequencesubscriptdelimited-[]subscript𝑨subscript𝑃𝑖𝑗𝑘superscriptsubscriptnorm𝑅𝐸𝑆𝐻𝐴𝑃𝐸subscriptsubscript𝒳subscript𝑃𝑖𝑗𝑅𝐸𝑆𝐻𝐴𝑃𝐸subscriptsubscript𝒳subscript𝑃𝑖𝑘22superscript𝜏2formulae-sequence𝑖1𝐻𝑊𝑗𝑘1superscript𝑙2[\boldsymbol{A}_{P_{i}}]_{j,k}=\exp{\{-\|RESHAPE(\mathcal{X}_{P_{i}})_{j}-% RESHAPE(\mathcal{X}_{P_{i}})_{k}\|_{2}^{2}/\tau^{2}\}},i=1,\cdots,HW,j,k=1,% \cdots,l^{2},[ bold_italic_A start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT = roman_exp { - ∥ italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } , italic_i = 1 , ⋯ , italic_H italic_W , italic_j , italic_k = 1 , ⋯ , italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where τ𝜏\tauitalic_τ is a fixed hyper-parameter set by user. To make pattern structure at different scale comparable, 𝑨Pisubscript𝑨subscript𝑃𝑖\boldsymbol{A}_{P_{i}}bold_italic_A start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is further normalized as follows:

(64) 𝑨Pi𝑨Pi/(𝟏𝑨Pi𝟏).subscript𝑨subscript𝑃𝑖subscript𝑨subscript𝑃𝑖superscript1topsubscript𝑨subscript𝑃𝑖1\boldsymbol{A}_{P_{i}}\leftarrow\boldsymbol{A}_{P_{i}}/(\boldsymbol{1}^{\top}% \boldsymbol{A}_{P_{i}}\boldsymbol{1}).bold_italic_A start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← bold_italic_A start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT / ( bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_1 ) .

Then the intra-task PSD can be formulated as a recursive process:

(65) [RESHAPE(𝒳i+1)]j=[RESHAPE(𝒳i)]j+βk𝒩(vj)𝑨j,k×[RESHAPE(𝒳i)]k,subscriptdelimited-[]𝑅𝐸𝑆𝐻𝐴𝑃𝐸superscript𝒳𝑖1𝑗subscriptdelimited-[]𝑅𝐸𝑆𝐻𝐴𝑃𝐸superscript𝒳𝑖𝑗𝛽subscript𝑘𝒩subscript𝑣𝑗subscript𝑨𝑗𝑘subscriptdelimited-[]𝑅𝐸𝑆𝐻𝐴𝑃𝐸superscript𝒳𝑖𝑘[RESHAPE(\mathcal{X}^{i+1})]_{j}=[RESHAPE(\mathcal{X}^{i})]_{j}+\beta\sum% \nolimits_{k\in\mathcal{N}(v_{j})}\boldsymbol{A}_{j,k}\times[RESHAPE(\mathcal{% X}^{i})]_{k},[ italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = [ italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_β ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT × [ italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where 𝑨𝑨\boldsymbol{A}bold_italic_A denotes the pattern structure of the whole feature map, 𝒩(vj)𝒩subscript𝑣𝑗\mathcal{N}(v_{j})caligraphic_N ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the neighbor set of the target pixel vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and β𝛽\betaitalic_β is a fixed hyper-parameter to control the residual connection. The iteration above contains multiple steps to guarantee that each local pattern is spread into distant regions, which is a diffused process.

To achieve cross-task pattern-structure propagation, inter-task PSD transfers the patterns from other tasks as follows:

[RESHAPE(𝒳(t))]j=subscriptdelimited-[]𝑅𝐸𝑆𝐻𝐴𝑃𝐸superscript𝒳𝑡𝑗absent\displaystyle[RESHAPE(\mathcal{X}^{(t)})]_{j}=[ italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = [RESHAPE(𝒳(t)]j+stk𝒩(vj)βst𝑨j,kst×[RESHAPE(𝒳(t))]k,\displaystyle[RESHAPE(\mathcal{X}^{(t)}]_{j}+\sum\nolimits_{s\neq t}\sum% \nolimits_{k\in\mathcal{N}(v_{j})}\beta_{s\rightarrow t}\boldsymbol{A}_{j,k}^{% s\rightarrow t}\times[RESHAPE(\mathcal{X}^{(t)})]_{k},[ italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_s ≠ italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s → italic_t end_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s → italic_t end_POSTSUPERSCRIPT × [ italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,
(66) s.t. 𝑨Pist=𝑨Pi(t)𝑨Pi(s)/[𝟏(𝑨Pi(t)𝑨Pi(s))𝟏],s,t=1,,T,formulae-sequencesuperscriptsubscript𝑨subscript𝑃𝑖𝑠𝑡direct-productsuperscriptsubscript𝑨subscript𝑃𝑖𝑡superscriptsubscript𝑨subscript𝑃𝑖𝑠delimited-[]superscript1topdirect-productsuperscriptsubscript𝑨subscript𝑃𝑖𝑡superscriptsubscript𝑨subscript𝑃𝑖𝑠1𝑠𝑡1𝑇\displaystyle\boldsymbol{A}_{P_{i}}^{s\rightarrow t}=\boldsymbol{A}_{P_{i}}^{(% t)}\odot\boldsymbol{A}_{P_{i}}^{(s)}/[\boldsymbol{1}^{\top}(\boldsymbol{A}_{P_% {i}}^{(t)}\odot\boldsymbol{A}_{P_{i}}^{(s)})\boldsymbol{1}],s,t=1,\cdots,T,bold_italic_A start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s → italic_t end_POSTSUPERSCRIPT = bold_italic_A start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⊙ bold_italic_A start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT / [ bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_A start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⊙ bold_italic_A start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ) bold_1 ] , italic_s , italic_t = 1 , ⋯ , italic_T ,

where {𝑨Pits}stsubscriptsuperscriptsubscript𝑨subscript𝑃𝑖𝑡𝑠𝑠𝑡\{\boldsymbol{A}_{P_{i}}^{ts}\}_{s\neq t}{ bold_italic_A start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_s ≠ italic_t end_POSTSUBSCRIPT represent the transferred pattern-structures from task s𝑠sitalic_s to the target task t𝑡titalic_t. In this manner, the PSD method distills feature similarity across different tasks.

Soft Attention. Attentive Single-Tasking of Multiple Tasks (ASTMT) (maninis2019attentive) argues the dilemma that the critical information from one task to another could be a nuisance while inferring multiple tasks together. ASTMT addresses it by single-tasking, a strategy that executes one task at a time instead of inferring all of them simultaneously. Technically, every task shares a backbone network in a hard manner but adapts its specificity with residual adapter (RA) branches, which is shown in Fig. 7(n). Suppose the RA operation is represented by RAt𝑅subscript𝐴𝑡RA_{t}italic_R italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the t𝑡titalic_t-th task, and its original residual skip connection is R𝑅Ritalic_R. Then the single-tasking process by RA is calculated as below:

(67) 𝒳l+1t=𝒳lt+R(𝒳lt)+RAt(𝒳lt),t=1,,T,formulae-sequencesuperscriptsubscript𝒳𝑙1𝑡superscriptsubscript𝒳𝑙𝑡𝑅superscriptsubscript𝒳𝑙𝑡𝑅subscript𝐴𝑡superscriptsubscript𝒳𝑙𝑡𝑡1𝑇\mathcal{X}_{l+1}^{t}=\mathcal{X}_{l}^{t}+R(\mathcal{X}_{l}^{t})+RA_{t}(% \mathcal{X}_{l}^{t}),t=1,\cdots,T,caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_R ( caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_R italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

where R𝑅Ritalic_R denotes the residual connection that is not influenced by the task. RAt𝑅subscript𝐴𝑡RA_{t}italic_R italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be naïve bottleneck convolutions or transformed to an attentive block SEt𝑆subscript𝐸𝑡SE_{t}italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (e.g. SE-ResNet block (hu2018squeeze)). In order to address the limitation of this adaptation failing to disentangle the shared and task-specific space, a GRadiEnt Adversarial Training (GREAT) process (sinha2018gradient) is introduced for different tasks to ensure that the shared backbone learns the shared representations and maintains this quality during the single-tasking process. More details of multi-task adversarial training are shown in § 2.2.7.

Refer to caption
Figure 12. The computational details of Context Pooling (CP).

Adaptive Task-Relational Context (ATRC) module (bruggemann2021exploring) enables global cross-task and local spatial-wise attention mechanisms to refine each task prediction, which is a general module that can be applied to any backbones across any supervised dense prediction tasks. The ATRC refinements begin with a hard-parameter sharing encoder, of which each task head can generate task-specific features 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and auxiliary predictions 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where t=1,,T𝑡1𝑇t=1,\cdots,Titalic_t = 1 , ⋯ , italic_T. Specifically, the features 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of each target task 𝒯tsubscript𝒯𝑡\mathcal{T}_{t}caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is refined by attending to the features 𝒳ssubscript𝒳𝑠\mathcal{X}_{s}caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT of every available task 𝒯s,s{1,,T}subscript𝒯𝑠𝑠1𝑇\mathcal{T}_{s},s\in\{1,\cdots,T\}caligraphic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ∈ { 1 , ⋯ , italic_T } within a separate Context Pooling (CP) block. As shown in Fig. 7(o), the original features 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and refined features {𝒳st}s=1Tsuperscriptsubscriptsubscript𝒳𝑠𝑡𝑠1𝑇\{\mathcal{X}_{s\rightarrow t}\}_{s=1}^{T}{ caligraphic_X start_POSTSUBSCRIPT italic_s → italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT are combined to predict the target task 𝒯tsubscript𝒯𝑡\mathcal{T}_{t}caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

There are three categories of context information (global context, local context, and label context) to be learned via refining features from the source task to the target task. The detailed illustration can be observed in Fig. 12 positioned to the right. Each CP block accepts the features 𝒳s,𝒳tsubscript𝒳𝑠subscript𝒳𝑡\mathcal{X}_{s},\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and predictions 𝒫s,𝒫tsubscript𝒫𝑠subscript𝒫𝑡\mathcal{P}_{s},\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the source task and target task, respectively. 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒳ssubscript𝒳𝑠\mathcal{X}_{s}caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are transformed into queries 𝑸𝑸\boldsymbol{Q}bold_italic_Q, keys 𝑲𝑲\boldsymbol{K}bold_italic_K and values 𝑽𝑽\boldsymbol{V}bold_italic_V (flattening along the spatial dimension and preserving channel dimension) as below:

𝑸=RESHAPE(CONV𝒲q(𝒳t)),𝑲=RESHAPE(CONV𝒲k(𝒳s)),formulae-sequence𝑸𝑅𝐸𝑆𝐻𝐴𝑃𝐸𝐶𝑂𝑁subscript𝑉subscript𝒲𝑞subscript𝒳𝑡𝑲𝑅𝐸𝑆𝐻𝐴𝑃𝐸𝐶𝑂𝑁subscript𝑉subscript𝒲𝑘subscript𝒳𝑠\displaystyle\boldsymbol{Q}=RESHAPE(CONV_{{\mathcal{W}}_{q}}(\mathcal{X}_{t}))% ,\boldsymbol{K}=RESHAPE(CONV_{{\mathcal{W}}_{k}}(\mathcal{X}_{s})),bold_italic_Q = italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , bold_italic_K = italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ,
(68) 𝑽=RESHAPE(CONV𝒲v(𝒳s)),𝑽𝑅𝐸𝑆𝐻𝐴𝑃𝐸𝐶𝑂𝑁subscript𝑉subscript𝒲𝑣subscript𝒳𝑠\displaystyle\boldsymbol{V}=RESHAPE(CONV_{{\mathcal{W}}_{v}}(\mathcal{X}_{s})),bold_italic_V = italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ,

where CONV()𝐶𝑂𝑁subscript𝑉CONV_{*}(\cdot)italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( ⋅ ) is a 1×1111\times 11 × 1 CONV-BN-ReLU operation, and 𝑸,𝑲,𝑽HW×C𝑸𝑲𝑽superscript𝐻𝑊𝐶\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}\in\mathbb{R}^{HW\times C}bold_italic_Q , bold_italic_K , bold_italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_C end_POSTSUPERSCRIPT. In the attention of global context, a target feature value visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at position i𝑖iitalic_i is substituted with

(69) 𝒗i=j=1Lsim(𝒒i,𝒌j)𝒗j/j=1Lsim(𝒒i,𝒌j),i=1,,L,formulae-sequencesuperscriptsubscript𝒗𝑖superscriptsubscript𝑗1𝐿simsubscript𝒒𝑖subscript𝒌𝑗subscript𝒗𝑗superscriptsubscript𝑗1𝐿simsubscript𝒒𝑖subscript𝒌𝑗𝑖1𝐿\boldsymbol{v}_{i}^{\prime}=\sum\nolimits_{j=1}^{L}\text{sim}(\boldsymbol{q}_{% i},\boldsymbol{k}_{j})\boldsymbol{v}_{j}/\sum\nolimits_{j=1}^{L}\text{sim}(% \boldsymbol{q}_{i},\boldsymbol{k}_{j}),i=1,\cdots,L,bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT sim ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT sim ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_i = 1 , ⋯ , italic_L ,

where L𝐿Litalic_L denotes the number of total pixels (i.e. feature values) and sim(,)(\cdot,\cdot)( ⋅ , ⋅ ) denotes an arbitrary similarity function. For the local context attention, let us denote by 𝒩p(i)subscript𝒩𝑝𝑖\mathcal{N}_{p}(i)caligraphic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_i ) the 2D spatial neighborhood of target pixel at position i𝑖iitalic_i with the patch extent p𝑝pitalic_p, then the spatial-wise local attention is formulated as below:

(70) 𝒗i=j𝒩p(i)softmax(𝒒i𝒌j/C)𝒗j,i=1,,L,formulae-sequencesuperscriptsubscript𝒗𝑖subscript𝑗subscript𝒩𝑝𝑖softmaxsubscript𝒒𝑖subscript𝒌𝑗𝐶subscript𝒗𝑗𝑖1𝐿\boldsymbol{v}_{i}^{\prime}=\sum\nolimits_{j\in\mathcal{N}_{p}(i)}\text{% softmax}(\boldsymbol{q}_{i}\boldsymbol{k}_{j}/\sqrt{C})\boldsymbol{v}_{j},i=1,% \cdots,L,bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT softmax ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / square-root start_ARG italic_C end_ARG ) bold_italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_i = 1 , ⋯ , italic_L ,

where C𝐶Citalic_C is the channel dimension of 𝑲𝑲\boldsymbol{K}bold_italic_K. For the T𝑇Titalic_T-label context and S𝑆Sitalic_S-label context defined in the label space that is partitioned into a set of disjoint label regions. The aim is to find a prototypical representation for each pixel. Suppose 𝒫tHW×Rtsubscript𝒫𝑡𝐻𝑊subscript𝑅𝑡\mathcal{P}_{t}\in HW\times R_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_H italic_W × italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where each entry of the last dimension indicates the degree that a pixel belongs to a label region r{1,,Rt}𝑟1subscript𝑅𝑡r\in\{1,\cdots,R_{t}\}italic_r ∈ { 1 , ⋯ , italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. For the T𝑇Titalic_T-label context, the keys 𝑲𝑲\boldsymbol{K}bold_italic_K and values 𝑽𝑽\boldsymbol{V}bold_italic_V are calculated via the the region prototypes as below:

(71) 𝑲=CONV𝒲k(𝒫^tRESHAPE(𝒳s)),𝑽=CONV𝒲v(𝒫^tRESHAPE(𝒳s)),formulae-sequence𝑲𝐶𝑂𝑁subscript𝑉subscript𝒲𝑘superscriptsubscript^𝒫𝑡top𝑅𝐸𝑆𝐻𝐴𝑃𝐸subscript𝒳𝑠𝑽𝐶𝑂𝑁subscript𝑉subscript𝒲𝑣superscriptsubscript^𝒫𝑡top𝑅𝐸𝑆𝐻𝐴𝑃𝐸subscript𝒳𝑠\boldsymbol{K}=CONV_{{\mathcal{W}}_{k}}({\hat{\mathcal{P}}_{t}}^{\top}RESHAPE(% \mathcal{X}_{s})),\boldsymbol{V}=CONV_{{\mathcal{W}}_{v}}(\hat{\mathcal{P}}_{t% }^{\top}RESHAPE(\mathcal{X}_{s})),bold_italic_K = italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) , bold_italic_V = italic_C italic_O italic_N italic_V start_POSTSUBSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ,

where 𝒫^tsubscript^𝒫𝑡\hat{\mathcal{P}}_{t}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the softmax normalization over the spatial dimension, and the matrix 𝒫^superscript^𝒫top\hat{\mathcal{P}}^{\top}over^ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT RESHAPE(𝒳s)Rt×C𝑅𝐸𝑆𝐻𝐴𝑃𝐸subscript𝒳𝑠superscriptsubscript𝑅𝑡𝐶RESHAPE(\mathcal{X}_{s})\in\mathbb{R}^{R_{t}\times C}italic_R italic_E italic_S italic_H italic_A italic_P italic_E ( caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT represents the region prototypes. Alternatively, 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is substituted with the source task prediction maps 𝒫ssubscript𝒫𝑠\mathcal{P}_{s}caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in the S𝑆Sitalic_S-label context. The outputs of both are attention-weighted combinations of features 𝒗𝒗\boldsymbol{v}bold_italic_v:

(72) 𝒗=softmax(𝒒𝒌/C)𝒗.superscript𝒗softmax𝒒superscript𝒌top𝐶𝒗\boldsymbol{v}^{\prime}=\text{softmax}(\boldsymbol{q}\boldsymbol{k}^{\top}/% \sqrt{C})\boldsymbol{v}.bold_italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = softmax ( bold_italic_q bold_italic_k start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / square-root start_ARG italic_C end_ARG ) bold_italic_v .

Deformable Mixer Transformers (DeMT) (zhang2023demt) is an encoder-decoder architecture that combines the merits of deformable CNNs (dai2017deformable; zhu2019deformable) and attention-based ViT (dosovitskiy2021an) to model multiple tasks, the details are shown in Fig. 7(p). The encoder, aka the deformable mixer in zhang2023demt, is aware of feature mixing across channels through 1×1111\times 11 × 1 convlutions and captures the deformable spatial features through learnable offsets. After task-specific features are learned by the encoder part, the task-aware transformer decoder first applies the task interactions based on the attention mechanism (MHSA + MLP) and then constructs the task query block to decode the task awareness features for each task. Suppose the transformer operator inside the task interaction block can be abstracted as

(73) 𝒳l+1=MHSAinter(q=LN(𝒳l),k=LN(𝒳l),v=LN(𝒳l)),subscript𝒳𝑙1𝑀𝐻𝑆subscript𝐴𝑖𝑛𝑡𝑒𝑟formulae-sequence𝑞𝐿𝑁subscript𝒳𝑙formulae-sequence𝑘𝐿𝑁subscript𝒳𝑙𝑣𝐿𝑁subscript𝒳𝑙{\mathcal{X}}_{l+1}=MHSA_{inter}(q=LN({\mathcal{X}}_{l}),k=LN({\mathcal{X}}_{l% }),v=LN({\mathcal{X}}_{l})),caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT = italic_M italic_H italic_S italic_A start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ( italic_q = italic_L italic_N ( caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , italic_k = italic_L italic_N ( caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , italic_v = italic_L italic_N ( caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) ,

where LN𝐿𝑁LNitalic_L italic_N denotes the layer norm on fused feature 𝒳lsubscript𝒳𝑙{\mathcal{X}}_{l}caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, and the subscripts l𝑙litalic_l and l+1𝑙1l+1italic_l + 1 denote the feature index before and after the task interaction block, respectively. To decode task awareness in the task query block, another transformer involves task-specific query before MHSAinter𝑀𝐻𝑆subscript𝐴𝑖𝑛𝑡𝑒𝑟MHSA_{inter}italic_M italic_H italic_S italic_A start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT (i.e., 𝒳ltsuperscriptsubscript𝒳𝑙𝑡{\mathcal{X}}_{l}^{t}caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT):

(74) 𝒳l+2t=MHSAquery(q=LN(𝒳lt),k=LN(𝒳l+1),v=LN(𝒳l+1)),t=1,,T,formulae-sequencesuperscriptsubscript𝒳𝑙2𝑡𝑀𝐻𝑆subscript𝐴𝑞𝑢𝑒𝑟𝑦formulae-sequence𝑞𝐿𝑁superscriptsubscript𝒳𝑙𝑡formulae-sequence𝑘𝐿𝑁subscript𝒳𝑙1𝑣𝐿𝑁subscript𝒳𝑙1𝑡1𝑇{\mathcal{X}}_{l+2}^{t}=MHSA_{query}(q=LN({\mathcal{X}}_{l}^{t}),k=LN({% \mathcal{X}}_{l+1}),v=LN({\mathcal{X}}_{l+1})),t=1,\cdots,T,caligraphic_X start_POSTSUBSCRIPT italic_l + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_M italic_H italic_S italic_A start_POSTSUBSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUBSCRIPT ( italic_q = italic_L italic_N ( caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_k = italic_L italic_N ( caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ) , italic_v = italic_L italic_N ( caligraphic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ) ) , italic_t = 1 , ⋯ , italic_T ,

where the subscript l+2𝑙2l+2italic_l + 2 denotes the feature index after the task query block.

Remarks (i) Knowledge distillation can utilize and transfer interpretable patterns across multiple tasks, resulting in meaningful principles that can provide guidance for architectural design. (ii) Knowledge distillation has the capability to aggregate refined features from multiple tasks at various scales, thereby enhancing task generalization ability and significantly improving performance. (iii) Knowledge distillation allows for the creation of smaller and more efficient student models on target tasks. The distilled knowledge helps compress the complex teacher model into a more lightweight student model while retaining a comparable level of performance. (iv) Knowledge distillation enables the transfer of knowledge across tasks, even if they are different or loosely related. This flexibility allows for leveraging insights from related tasks to enhance the learning process, resulting in better performance on each individual task. (v) The overall performance heavily depends on the quality and capabilities of the teacher model. If the teacher model is not well-trained or lacks expertise in the specific tasks, the knowledge distillation process may not be effective, limiting the potential benefits. (vi) Implementing knowledge distillation adds extra computational complexity that often involves the processes of training, transferring, and fine-tuning, thus inevitably being time-consuming and resource-intensive.

2.2.5. Scalarization Approach.

One of the most popular methods to solve multi-task learning problems is the scalarization approach, which formulates the problem as a linear combination of loss functions of different tasks (kendall2018multi; liu2019end; chen2018gradnorm; Senushkin_2023_CVPR) as

(75) min𝑾total(𝑾)=t=1Tα(t)(t)(𝑾)subscript𝑾subscripttotal𝑾superscriptsubscript𝑡1𝑇superscript𝛼𝑡superscript𝑡𝑾\min_{\boldsymbol{W}}{\mathcal{L}}_{\text{total}}(\boldsymbol{W})=\sum_{t=1}^{% T}\alpha^{(t)}{\mathcal{L}}^{(t)}(\boldsymbol{W})roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT ( bold_italic_W ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W )

where {α(t)}t=1T+superscriptsubscriptsuperscript𝛼𝑡𝑡1𝑇subscript\{\alpha^{(t)}\}_{t=1}^{T}\subset\mathbb{R}_{+}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⊂ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT are the tasks’ weights and are used to encode preferences over different tasks. 𝑾𝑾\boldsymbol{W}bold_italic_W is the model parameter and {(t)}t=1Tsuperscriptsubscriptsuperscript𝑡𝑡1𝑇\{{\mathcal{L}}^{(t)}\}_{t=1}^{T}{ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT are loss functions for different tasks. In each loss function (t)superscript𝑡{\mathcal{L}}^{(t)}caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, we drop the dependency on training samples {𝑿t,𝒚t}superscript𝑿𝑡superscript𝒚𝑡\{\boldsymbol{X}^{t},\boldsymbol{y}^{t}\}{ bold_italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } to avoid cluttered notations.

Gradient-based methods are perhaps the most popular choices to solve Eq. (75), whose update rule of 𝑾𝑾\boldsymbol{W}bold_italic_W takes the form of 𝑾𝑾+η𝒅𝑾𝑾𝜂𝒅\boldsymbol{W}\leftarrow\boldsymbol{W}+\eta\boldsymbol{d}bold_italic_W ← bold_italic_W + italic_η bold_italic_d, where η>0𝜂0\eta>0italic_η > 0 is the learning rate and 𝒅𝒅\boldsymbol{d}bold_italic_d is the search direction. 𝒅𝒅\boldsymbol{d}bold_italic_d is a function of {α(t),(t)}t=1Tsuperscriptsubscriptsuperscript𝛼𝑡superscript𝑡𝑡1𝑇\{\alpha^{(t)}\triangledown,{\mathcal{L}}^{(t)}\}_{t=1}^{T}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ▽ , caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, for example, 𝒅=t=1Tα(t)(t)(𝑾)𝒅superscriptsubscript𝑡1𝑇superscript𝛼𝑡superscript𝑡𝑾\boldsymbol{d}=-\sum_{t=1}^{T}\alpha^{(t)}\triangledown{\mathcal{L}}^{(t)}(% \boldsymbol{W})bold_italic_d = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ). Aside from the challenge of choosing a proper learning rate η𝜂\etaitalic_η, there are two additional challenges, dominant gradients and conflicting gradients, see Fig. 13 for an illustration. Dominating gradient issue occurs when the norm of gradients of some tasks’ losses are significantly larger than the others, hence the updating direction 𝒅𝒅\boldsymbol{d}bold_italic_d are biased towards to tasks with larger gradient norm. Conflicting gradients issue arises when one makes progress in one task, the performance of another task is degraded.

Refer to caption
(a) Dominant Gradients Issue.
Refer to caption
(b) Conflicting Gradients Issue.
Figure 13. (a)dominant gradients issue. The update direction 𝒅𝒅\boldsymbol{d}bold_italic_d is dominated by the negative gradient of the loss of task 1. (b)conflicting gradients issue. When {α(t)}t=13superscriptsubscriptsuperscript𝛼𝑡𝑡13\{\alpha^{(t)}\}_{t=1}^{3}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT are not properly set, the update direction 𝒅𝒅\boldsymbol{d}bold_italic_d can decreases the loss of task 1 and 3 while increases the loss of task 2. Therefore, the pefromance on the task 2 is compromised.

In the remainder of this section, we review some works with different philosophies to address dominant and conflicting gradients’ challenges. These methods can be roughly characterized as gradient correction approach, where transformations are made to gradients to address the conflicting gradients issue and dynamic weighting, where {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } are updated in each iteration to address the dominant gradients issue.

Refer to caption
(a) Conflicting.
Refer to caption
(b) Non-conflicting.
Refer to caption
(c) Projecting 𝒈isubscript𝒈𝑖\boldsymbol{g}_{i}bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 𝒏jsubscript𝒏𝑗\boldsymbol{n}_{j}bold_italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
Refer to caption
(d) Projecting 𝒈jsubscript𝒈𝑗\boldsymbol{g}_{j}bold_italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to 𝒏isubscript𝒏𝑖\boldsymbol{n}_{i}bold_italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.
Figure 14. Demonstration of gradient projection technique used in yu2020gradient.

Gradient Correction. Projecting Conflicting Gradients (PCGrad) (yu2020gradient) proposes to mitigate the conflicting gradients issue by projecting the conflicting gradients in the orthogonal subspace. Formally, PCGrad (yu2020gradient) defines two gradients (gi,gj)subscript𝑔𝑖subscript𝑔𝑗(g_{i},g_{j})( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) to be conflicting if giTgj<0superscriptsubscript𝑔𝑖𝑇subscript𝑔𝑗0g_{i}^{T}g_{j}<0italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < 0. To address this issue, instead of forming the search direction as 𝒅=(gi+gj)𝒅subscript𝑔𝑖subscript𝑔𝑗\boldsymbol{d}=-(g_{i}+g_{j})bold_italic_d = - ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), PCGrad suggested using 𝒅=(Projnj(gi)+Projni(gi))𝒅subscriptProjsubscript𝑛𝑗subscript𝑔𝑖subscriptProjsubscript𝑛𝑖subscript𝑔𝑖\boldsymbol{d}=-(\textbf{Proj}_{n_{j}}(g_{i})+\textbf{Proj}_{n_{i}}(g_{i}))bold_italic_d = - ( Proj start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + Proj start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ), where niTgi=0superscriptsubscript𝑛𝑖𝑇subscript𝑔𝑖0n_{i}^{T}g_{i}=0italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 and njTgj=0superscriptsubscript𝑛𝑗𝑇subscript𝑔𝑗0n_{j}^{T}g_{j}=0italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 and Proj is the Euclidean projection operator. See Fig. 14 for an illustration. This method, from the perspective of multi-objective optimization perspective (which will be discussed in the next section), is a particular choice of choosing a common descent direction. Gradient sign Dropout (GradDrop)(chen2020just) attributed conflicts to the differences in the signs of gradients along each coordinate direction. Motivated by the dropout, a probabilistic masking procedure is proposed to keep only gradients consistent in signs in each update. Conflict-Averse Gradient descent (CAGrad) (liu2021conflictaverse) proposes to mitigate gradient conflicts by solving the problem

(76) max𝒅mint[T](t)(𝑾)T(𝒅) s.t. 𝒅total(𝑾)ctotal(𝑾),subscript𝒅subscript𝑡delimited-[]𝑇superscript𝑡superscript𝑾𝑇𝒅 s.t. norm𝒅subscripttotal𝑾𝑐normsubscripttotal𝑾\max_{\boldsymbol{d}}\min_{t\in[T]}\triangledown{\mathcal{L}}^{(t)}(% \boldsymbol{W})^{T}(-\boldsymbol{d})\text{ s.t. }\left\|\boldsymbol{d}-% \triangledown{\mathcal{L}}_{\text{total}}(\boldsymbol{W})\right\|\leq c\left\|% \triangledown{\mathcal{L}}_{\text{total}}(\boldsymbol{W})\right\|,roman_max start_POSTSUBSCRIPT bold_italic_d end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( - bold_italic_d ) s.t. ∥ bold_italic_d - ▽ caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT ( bold_italic_W ) ∥ ≤ italic_c ∥ ▽ caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT ( bold_italic_W ) ∥ ,

where c>0𝑐0c>0italic_c > 0 is a prescribed parameter. The intuition is that mint[T](t)(𝑾)Tdsubscript𝑡delimited-[]𝑇superscript𝑡superscript𝑾𝑇𝑑-\min_{t\in[T]}\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W})^{T}d- roman_min start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_d can be used as the approximated evaluation of the conflict among objectives, and one wants to find the direction 𝒅𝒅\boldsymbol{d}bold_italic_d that minimizes such a conflict while stays close to the original negative gradient of total(𝑾)subscripttotal𝑾{\mathcal{L}}_{\text{total}}(\boldsymbol{W})caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT ( bold_italic_W ).

Reducing conflicting gradient (Recon) (shi2023recon) empirically observes that PCGrad, GradDrop, and CAGrad (yu2020gradient; chen2020just; liu2021conflictaverse) can only slightly reduce the occurrence of conflicting gradients (compared to joint-training666The joint-training refers to the case that α(t)=1superscript𝛼𝑡1\alpha^{(t)}=1italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = 1 for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] in Eq. (75).) in some cases, and in some other cases they even increase the occurrence. Therefore, Recon proposed to analyze parameters in a layer-wise fashion to pinpoint the shared parameters that are most likely to incur conflicting gradients. Concretely, let (gik,gjk)superscriptsubscript𝑔𝑖𝑘superscriptsubscript𝑔𝑗𝑘(g_{i}^{k},g_{j}^{k})( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) be the gradients of the (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) task pair with respect to the k𝑘kitalic_kth layer’s parameters. (gik,gjk)superscriptsubscript𝑔𝑖𝑘superscriptsubscript𝑔𝑗𝑘(g_{i}^{k},g_{j}^{k})( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) is said to be S𝑆Sitalic_S-conflicting if sk:=gik,gjkgikgjk<Sassignsuperscript𝑠𝑘superscriptsubscript𝑔𝑖𝑘superscriptsubscript𝑔𝑗𝑘normsuperscriptsubscript𝑔𝑖𝑘normsuperscriptsubscript𝑔𝑗𝑘𝑆s^{k}:=\frac{\langle g_{i}^{k},g_{j}^{k}\rangle}{\left\|g_{i}^{k}\right\|\left% \|g_{j}^{k}\right\|}<Sitalic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT := divide start_ARG ⟨ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ end_ARG start_ARG ∥ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ ∥ italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ end_ARG < italic_S for any s[1,0)𝑠10s\in[-1,0)italic_s ∈ [ - 1 , 0 ). Recon first trained the models via any gradient-based method with E𝐸Eitalic_E epochs, e.g., PCGrad, GradDrop, and CAGrad, and then derived the conflicting scores for each layer over E𝐸Eitalic_E epochs to identify the top K𝐾Kitalic_K layers with the highest (most negative) conflicting scores. Finally, Recon turned these K𝐾Kitalic_K layers’ parameters into task-specific parameters and retrained the network from scratch. As pointed out in shi2023recon, while Recon is sensitive to the parameters K𝐾Kitalic_K and S𝑆Sitalic_S, one only needs to tune them once for a given network architecture.

Dynamic weighting. GradNorm proposed in  chen2018gradnorm suggests to mitigate the dominant gradient issue so that gradients for each task have the proper magnitude. The strategy to adjust {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } is based on the average gradient norm of each task and the relative progress achieved for each task. With this information, GradNorm constructs a reference point at each iteration, {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } was then selected to minimize the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distance between the actual gradient of each task and the reference point. Concretely, let GN𝑾(t)(i)=𝑾α(t)(i)(t)(i)2𝐺superscriptsubscript𝑁𝑾𝑡𝑖subscriptnormsubscript𝑾superscript𝛼𝑡𝑖superscript𝑡𝑖2GN_{\boldsymbol{W}}^{(t)}(i)=\|\triangledown_{\boldsymbol{W}}\alpha^{(t)}(i)% \mathcal{L}^{(t)}(i)\|_{2}italic_G italic_N start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) = ∥ ▽ start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be the measure of 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of the t𝑡titalic_tth task’s weighted gradient at iteration i𝑖iitalic_i777We add addition index i𝑖iitalic_i to indicate their dependence on the iteration counter i𝑖iitalic_i.. Next, the averaged gradient norm across all tasks was calculated as GN¯𝑾(i)=𝔼t[GN𝑾(t)(i)]subscript¯𝐺𝑁𝑾𝑖subscript𝔼𝑡delimited-[]𝐺superscriptsubscript𝑁𝑾𝑡𝑖\overline{GN}_{\boldsymbol{W}}(i)=\mathbb{E}_{t}[GN_{\boldsymbol{W}}^{(t)}(i)]over¯ start_ARG italic_G italic_N end_ARG start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT ( italic_i ) = blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_G italic_N start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) ]. To measure the training progress of each task, ~(t)(i)=(t)(i)(t)(0)superscript~𝑡𝑖superscript𝑡𝑖superscript𝑡0\tilde{\mathcal{L}}^{(t)}(i)=\frac{\mathcal{L}^{(t)}(i)}{\mathcal{L}^{(t)}(0)}over~ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) = divide start_ARG caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( 0 ) end_ARG was introduced, which inversely proportional to the training rate. Lastly, the relative inverse training rate for task t𝑡titalic_t can be formulated as r(t)(i)=(t)(i)(t)(0)/𝔼t[~(t)(i)]superscript𝑟𝑡𝑖superscript𝑡𝑖superscript𝑡0subscript𝔼𝑡delimited-[]superscript~𝑡𝑖r^{(t)}(i)=\frac{\mathcal{L}^{(t)}(i)}{\mathcal{L}^{(t)}(0)}/\mathbb{E}_{t}[% \tilde{\mathcal{L}}^{(t)}(i)]italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) = divide start_ARG caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( 0 ) end_ARG / blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ over~ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) ]. The higher value of r(t)(i)superscript𝑟𝑡𝑖r^{(t)}(i)italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) indicates a higher gradient magnitude for task t𝑡titalic_t at iteration i𝑖iitalic_i, which encourages task t𝑡titalic_t to learn more quickly. Finally, the weight αt+1superscript𝛼𝑡1\alpha^{t+1}italic_α start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT was determined by solving the following problem

(77) min{α(t)}t=1Tt=1TGN𝑾(t)(i)GN¯𝑾(i)[r(t)(i)]ζ1,subscriptsuperscriptsubscriptsuperscript𝛼𝑡𝑡1𝑇superscriptsubscript𝑡1𝑇subscriptnorm𝐺superscriptsubscript𝑁𝑾𝑡𝑖subscript¯𝐺𝑁𝑾𝑖superscriptdelimited-[]superscript𝑟𝑡𝑖𝜁1\min_{\{\alpha^{(t)}\}_{t=1}^{T}}\sum\nolimits_{t=1}^{T}\|GN_{\boldsymbol{W}}^% {(t)}(i)-\overline{GN}_{\boldsymbol{W}}(i)\cdot{[r^{(t)}(i)]}^{\zeta}\|_{1},roman_min start_POSTSUBSCRIPT { italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_G italic_N start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) - over¯ start_ARG italic_G italic_N end_ARG start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT ( italic_i ) ⋅ [ italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) ] start_POSTSUPERSCRIPT italic_ζ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where ζ𝜁\zetaitalic_ζ is introduced to avoid dramatically different learning dynamics between tasks caused by various task complexity. Inspired by GradNorm, Dynamic Weight Averaging (DWA) is another strategy proposed in liu2019end to balance the task-specific losses. The updating process of α(t)(i)superscript𝛼𝑡𝑖\alpha^{(t)}(i)italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) is defined as α(t)(i)=tα(t)(i)er(t)(i1)/Tt=1Ter(t)(i1)/T and r(t)(i1)=(t)(i1)(t)(i2),superscript𝛼𝑡𝑖subscript𝑡superscript𝛼𝑡𝑖superscript𝑒superscript𝑟𝑡𝑖1𝑇superscriptsubscript𝑡1𝑇superscript𝑒superscript𝑟𝑡𝑖1𝑇 and superscript𝑟𝑡𝑖1superscript𝑡𝑖1superscript𝑡𝑖2\alpha^{(t)}(i)=\frac{\sum_{t}\alpha^{(t)}(i)e^{r^{(t)}(i-1)/T}}{\sum_{t=1}^{T% }e^{r^{(t)}(i-1)/T}}\text{ and }r^{(t)}(i-1)=\frac{\mathcal{L}^{(t)}(i-1)}{% \mathcal{L}^{(t)}(i-2)},italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) italic_e start_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i - 1 ) / italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i - 1 ) / italic_T end_POSTSUPERSCRIPT end_ARG and italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i - 1 ) = divide start_ARG caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i - 1 ) end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i - 2 ) end_ARG , where r(t)(i)superscript𝑟𝑡𝑖r^{(t)}(i)italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_i ) is the relative progress for the task t𝑡titalic_t at the iteration i𝑖iitalic_i. Reinforced MTL (RMTL) (liu2018exploration, Chapter 3) adjusts {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } using the reinforcement learning strategy and Loss-Balanced Task Weighting. LBTW (liu2019loss) combines GradNorm and RMTL in a way such that the weights {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } were adapted to both samples and tasks. Impartial MTL (IMTL) (liu2021towards) proposes to update {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } in each iteration such that the aggregated gradient t=1Tα(t)(t)(𝑾)superscriptsubscript𝑡1𝑇superscript𝛼𝑡superscript𝑡𝑾\sum_{t=1}^{T}\alpha^{(t)}\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W})∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) has equal projections onto the raw gradients of individual tasks. It achieves this goal by solving the following linear system (with respect to {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT })

(t=1Tα(t)(t)(𝑾))T(t)(𝑾)(t)(𝑾)=(t=1Tα(t)(t)(𝑾))T(1)(𝑾)(1)(𝑾), for t{2,,T}formulae-sequencesuperscriptsuperscriptsubscript𝑡1𝑇superscript𝛼𝑡superscript𝑡𝑾𝑇superscript𝑡𝑾normsuperscript𝑡𝑾superscriptsuperscriptsubscript𝑡1𝑇superscript𝛼𝑡superscript𝑡𝑾𝑇superscript1𝑾normsuperscript1𝑾 for 𝑡2𝑇\displaystyle\left(\sum_{t=1}^{T}\alpha^{(t)}\triangledown{\mathcal{L}}^{(t)}(% \boldsymbol{W})\right)^{T}\frac{\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W% })}{\left\|\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W})\right\|}=\left(% \sum_{t=1}^{T}\alpha^{(t)}\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W})% \right)^{T}\frac{\triangledown{\mathcal{L}}^{(1)}(\boldsymbol{W})}{\left\|% \triangledown{\mathcal{L}}^{(1)}(\boldsymbol{W})\right\|},\text{ for }t\in\{2,% \cdots,T\}( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) end_ARG start_ARG ∥ ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) ∥ end_ARG = ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG ▽ caligraphic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_W ) end_ARG start_ARG ∥ ▽ caligraphic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_W ) ∥ end_ARG , for italic_t ∈ { 2 , ⋯ , italic_T }
t=1Tα(t)=1.superscriptsubscript𝑡1𝑇superscript𝛼𝑡1\displaystyle\sum_{t=1}^{T}\alpha^{(t)}=1.∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = 1 .

Before solving for {α(t)}t=1Tsuperscriptsubscriptsuperscript𝛼𝑡𝑡1𝑇\{\alpha^{(t)}\}_{t=1}^{T}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, IMTL also proposes a heuristic to scale {(t)(𝑾)}superscript𝑡𝑾\{{\mathcal{L}}^{(t)}(\boldsymbol{W})\}{ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) } such that all losses are in the similar scales, which essentially is another scaling of the {(t)(𝑾)}superscript𝑡𝑾\{\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W})\}{ ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) }. Achievement-based MTL (yun2023achievement) suggests defining the weights for each task by measuring the training progress as α(t)=(1acct/(mmaxacct))γsuperscript𝛼𝑡superscript1subscriptacc𝑡𝑚subscriptmaxacc𝑡𝛾\alpha^{(t)}=(1-\text{acc}_{t}/(m\cdot\text{maxacc}_{t}))^{\gamma}italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ( 1 - acc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / ( italic_m ⋅ maxacc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT where γ>0𝛾0\gamma>0italic_γ > 0, m>1𝑚1m>1italic_m > 1, acctsubscriptacc𝑡\text{acc}_{t}acc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and maxacctsubscriptmaxacc𝑡\text{maxacc}_{t}maxacc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the current training accuracy (trained in the multitask setting) for the task t𝑡titalic_t and the max training accuracy (trained in the single setting), respectively. And Achievement-based MTL considers using the geometric mean instead of arithmetic mean to define the loss function; namely, it solves min𝑾(t=1T(L(t)(𝑾))α(t))1/T\min_{\boldsymbol{W}}\left(\prod_{t=1}^{T}(L^{(t)}(\boldsymbol{W}))^{\alpha^{(% t)}}\right)^{1/T}roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT ( ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) ) start_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_T end_POSTSUPERSCRIPT.

Uncertainty Weighting (kendall2018multi) takes a different perspective from the above dynamic weighting approaches. This work assumes there are underlying distributions for different tasks’ labels, and different tasks are independent. The final loss function, deriving from the likelihood perspective, takes the same form as Eq. (75) with {αk(t)}superscriptsubscript𝛼𝑘𝑡\{\alpha_{k}^{(t)}\}{ italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } being specified as the reciprocal of the variance of each distribution used to modeling each task and loss function. Instead of just optimizing over the parameter 𝑾𝑾\boldsymbol{W}bold_italic_W, kendall2018multi optimizes 𝑾𝑾\boldsymbol{W}bold_italic_W and {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } simultaneously

(78) min𝑾,{α(t)}t=1Ttotal(𝑾)=t=1Tα(t)(t)(𝑾).subscript𝑾superscriptsubscriptsuperscript𝛼𝑡𝑡1𝑇subscripttotal𝑾superscriptsubscript𝑡1𝑇superscript𝛼𝑡superscript𝑡𝑾\min_{\boldsymbol{W},\{\alpha^{(t)}\}_{t=1}^{T}}{\mathcal{L}}_{\text{total}}(% \boldsymbol{W})=\sum_{t=1}^{T}\alpha^{(t)}{\mathcal{L}}^{(t)}(\boldsymbol{W}).roman_min start_POSTSUBSCRIPT bold_italic_W , { italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT ( bold_italic_W ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) .

At this point, one can observe that all aforementioned works under the dynamic weighting category, excluding kendall2018multi, do not necessarily respect optimization problem formulation in Eq. (75) even though they empirically work well in producing useful solutions. Nonetheless, one can also regard the dynamic weighting approach as either solving Eq. (78) using different rule-based strategies to update {α(t)}t=1Tsuperscriptsubscriptsuperscript𝛼𝑡𝑡1𝑇\{\alpha^{(t)}\}_{t=1}^{T}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT or using gradient-based methods to inexactly solve a sequence of problems in the form of Eq. (75).

To conclude this section, we point out that there are some works that try to address two issues simultaneously (javaloy2022rotograd; Senushkin_2023_CVPR). For example, Alignment for MTL (Aligned-MTL (Senushkin_2023_CVPR) considers the condition number of the linear system 𝒅=𝑮𝜶𝒅𝑮𝜶\boldsymbol{d}=\boldsymbol{G}\boldsymbol{\alpha}bold_italic_d = bold_italic_G bold_italic_α as a measure of the degree of the severeness of both gradient dominance and conflict, where 𝑮=[(1),,(T)]𝑮superscript1superscript𝑇\boldsymbol{G}=[-\triangledown{\mathcal{L}}^{(1)},\ldots,-\triangledown{% \mathcal{L}}^{(T)}]bold_italic_G = [ - ▽ caligraphic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , - ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ] and 𝜶=(α(1),,α(T))T𝜶superscriptsuperscript𝛼1superscript𝛼𝑇𝑇\boldsymbol{\alpha}=(\alpha^{(1)},\ldots,\alpha^{(T)})^{T}bold_italic_α = ( italic_α start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_α start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Therefore, the authors propose to find well-conditioned 𝑮^^𝑮\hat{\boldsymbol{G}}over^ start_ARG bold_italic_G end_ARG to approximate 𝑮𝑮\boldsymbol{G}bold_italic_G and, therefore, obtain a refined update direction 𝒅^^𝒅\hat{\boldsymbol{d}}over^ start_ARG bold_italic_d end_ARG. Concretely, the author proposed to solve min𝑮^𝑮^𝑮s.t.𝑮^T𝑮^=Isubscript^𝑮norm^𝑮𝑮s.t.superscript^𝑮𝑇^𝑮𝐼\min_{\hat{\boldsymbol{G}}}\left\|\hat{\boldsymbol{G}}-\boldsymbol{G}\right\|% \quad\text{s.t.}\quad\hat{\boldsymbol{G}}^{T}\hat{\boldsymbol{G}}=Iroman_min start_POSTSUBSCRIPT over^ start_ARG bold_italic_G end_ARG end_POSTSUBSCRIPT ∥ over^ start_ARG bold_italic_G end_ARG - bold_italic_G ∥ s.t. over^ start_ARG bold_italic_G end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_G end_ARG = italic_I, by singular value decomposition (SVD) and use the refined direction 𝒅^=𝑮^𝜶^𝒅^𝑮𝜶\hat{\boldsymbol{d}}=\hat{\boldsymbol{G}}\boldsymbol{\alpha}over^ start_ARG bold_italic_d end_ARG = over^ start_ARG bold_italic_G end_ARG bold_italic_α instead of 𝒅=𝑮𝜶𝒅𝑮𝜶\boldsymbol{d}=\boldsymbol{G}\boldsymbol{\alpha}bold_italic_d = bold_italic_G bold_italic_α. The convergence rate of the proposed algorithm is established under the assumption that all loss functions are Lipschitz smooth and bounded below. Although the numerical results are promising, one should be aware of the computation cost of the SVD despite the existence of efficient algorithms (bondhugula2006fast).

Remarks (i) Scalarization approach features in its simplicity as it transforms a multi-objective problem into a single-objective one. Hence, it is easy to implement, and many off-shelf optimizers can be applied. (ii) Generally, the scalarization approach has computational efficiency advantages over multi-objective optimization approaches, as will be discussed in the next section. (iii) The solution found by the scalarization approach might lack diversity as it could be biased to a solution depending on prescribed weights888We characterize the diversity through the Pareto Front, which will be discussed in the next section.. Also, it is hard to conduct convergence analysis, especially for the dynamic weighting approach, since it attempts to solve a sequence of problems inexactly.

2.2.6. Multi-objective Optimization (MOO).

In contrast to the scalarization approach, which converts different objective functions {(1),,(T)}superscript1superscript𝑇\{{\mathcal{L}}^{(1)},\ldots,{\mathcal{L}}^{(T)}\}{ caligraphic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , caligraphic_L start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT } into one aggregated objective function totalsubscripttotal{\mathcal{L}}_{\text{total}}caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT and then optimizes it, MOO, aims to simultaneously optimizing several objective functions (potentially conflicting). Concretely, MOO aims to solve the following problem

(79) min𝑾(𝑾)=((1)(𝑾),,(T)(𝑾))Ts.t.𝑾,subscript𝑾𝑾superscriptsuperscript1𝑾superscript𝑇𝑾𝑇s.t.𝑾\min_{\boldsymbol{W}}{\mathcal{L}}(\boldsymbol{W})=({\mathcal{L}}^{(1)}(% \boldsymbol{W}),\ldots,{\mathcal{L}}^{(T)}(\boldsymbol{W}))^{T}\quad\text{s.t.% }\quad\boldsymbol{W}\in{\mathcal{F}},roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT caligraphic_L ( bold_italic_W ) = ( caligraphic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_W ) , … , caligraphic_L start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ( bold_italic_W ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT s.t. bold_italic_W ∈ caligraphic_F ,

where {\mathcal{F}}caligraphic_F is the feasible domain for 𝑾𝑾\boldsymbol{W}bold_italic_W (examples will be given shortly). For a comprehensive background on the MOO topic, we refer readers to ehrgott2005multicriteria; for readers who prefer a quick overview of this subject, we recommend liu2020review. Below, we just provide the minimum backgrounds required to make the exposition accessible to readers with backgrounds in single objective optimization.

We begin with a few concepts that help readers understand the type of solutions that MOO algorithms can normally obtain.

Definition 4.
  1. (1)

    𝑾superscript𝑾\boldsymbol{W}^{*}bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is called a weak Pareto minimizer of {\mathcal{L}}caligraphic_L over {\mathcal{F}}caligraphic_F if there is no 𝑾𝑾\boldsymbol{W}\in{\mathcal{F}}bold_italic_W ∈ caligraphic_F such that (𝑾)<(𝑾)𝑾superscript𝑾{\mathcal{L}}(\boldsymbol{W})<{\mathcal{L}}(\boldsymbol{W}^{*})caligraphic_L ( bold_italic_W ) < caligraphic_L ( bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Here, <<< is the element-wise comparison. The set PF()={(𝑾)|𝑾PF({\mathcal{L}})=\{{\mathcal{L}}(\boldsymbol{W}^{*})~{}|~{}\boldsymbol{W}^{*}italic_P italic_F ( caligraphic_L ) = { caligraphic_L ( bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a weak Pareto minimizer}}\}} is called the Pareto front.

  2. (2)

    𝑾superscript𝑾\boldsymbol{W}^{*}bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is called a strict Pareto minimizer of {\mathcal{L}}caligraphic_L over {\mathcal{F}}caligraphic_F if there is no 𝑾𝑾\boldsymbol{W}\in{\mathcal{F}}bold_italic_W ∈ caligraphic_F such that (𝑾)(𝑾)𝑾superscript𝑾{\mathcal{L}}(\boldsymbol{W})\leq{\mathcal{L}}(\boldsymbol{W}^{*})caligraphic_L ( bold_italic_W ) ≤ caligraphic_L ( bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and 𝑾𝑾𝑾superscript𝑾\boldsymbol{W}\neq\boldsymbol{W}^{*}bold_italic_W ≠ bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.Here, \leq is the element-wise comparison.

  3. (3)

    𝑾superscript𝑾\boldsymbol{W}^{*}bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is called a Pareto stationary point of {\mathcal{L}}caligraphic_L over {\mathcal{F}}caligraphic_F if maxt=1,,T(𝑾𝑾)T(t)(𝑾)0\max_{t=1,\cdots,T}(\boldsymbol{W}-\boldsymbol{W}^{*})^{T}\triangledown{% \mathcal{L}}^{(t)}(\boldsymbol{W}^{*})\geq 0roman_max start_POSTSUBSCRIPT italic_t = 1 , ⋯ , italic_T end_POSTSUBSCRIPT ( bold_italic_W - bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ 0 for all 𝑾𝑾\boldsymbol{W}\in{\mathcal{F}}bold_italic_W ∈ caligraphic_F. Intuitively, this definition implies that for the objective function, there exists at least one such that there does not exist any feasible direction d:=𝑾𝑾assign𝑑𝑾superscript𝑾d:=\boldsymbol{W}-\boldsymbol{W}^{*}italic_d := bold_italic_W - bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to further decrease it.

We give a graphical illustration of all these Pareto-related points in Fig. 15. In Fig. 15, the 𝑾superscript𝑾\boldsymbol{W}^{*}bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTs that correspond to circles and crosses are Pareto stationary points. However, when {(t)(𝑾)}t=1Tsuperscriptsubscriptsuperscript𝑡𝑾𝑡1𝑇\{{\mathcal{L}}^{(t)}(\boldsymbol{W})\}_{t=1}^{T}{ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT are not convex, the Pareto stationary points can generate {t(𝑾)}superscript𝑡superscript𝑾\{{\mathcal{L}}^{t}(\boldsymbol{W}^{*})\}{ caligraphic_L start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } that are NOT sit on the Pareto font. An analogy for this phenomenon in single objective optimization would be that a stationary point of a nonconvex objective function may not be the global minimum. Due to the nonconvexity nature of neural networks, algorithms considered here (when the convergence analysis is provided), if not all, can only guarantee to find the Pareto stationary point instead of the weak/strict Pareto minimizers. However, if additional assumptions like (strong) convexity are assumed, then one can obtain solutions whose objective values are on the Pareto front. In the sequel, we review some works with different strategies to generate the a (set of) Pareto stationary point(s).

Refer to caption
Figure 15. An illustration of weak and strict Pareto minimizers and Pareto front. We emphasize that the circles and crosses on the curve are NOT weak and strict Pareto minimizers. Instead, those 𝑾superscript𝑾\boldsymbol{W}^{*}bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTs that generate circles and crosses are weak and strict Pareto minimizers, respectively. In this figure, all circles and crosses are Pareto stationary points. We remark that in this example, the Pareto front is convex and continuous. The Pareto front can also be non-convex and/or discontinued, for example, see liu2020review.

The first line of works, e.g., sener2018multi; lin2019pareto; navon2022multi were built upon and extended the seminal work, Multiple-Gradient Descent Algorithm (MGDA) (fliege2000steepest) to the neural network settings. The essence of MGDA is, at each iteration, to find a common descent direction 𝒅𝒅\boldsymbol{d}bold_italic_d that decreases all objective functions {(t)}superscript𝑡\{{\mathcal{L}}^{(t)}\}{ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } simultaneously. If no such direction exists, the algorithm terminates and returns a (set of) Pareto stationary point(s). MGDA constructs the common descent direction 𝒅𝒅\boldsymbol{d}bold_italic_d by solving the following optimization problem999For simplicity, we now only consider the unconstrained case =nsuperscript𝑛{\mathcal{F}}=\mathbb{R}^{n}caligraphic_F = blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT; we will discuss the constrained case nsuperscript𝑛{\mathcal{F}}\subset\mathbb{R}^{n}caligraphic_F ⊂ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT shortly.

(80) max𝒅nmint=1,,T((𝑾)(t))Td+12d2.\max_{\boldsymbol{d}\in\mathbb{R}^{n}}\min_{t=1,\ldots,T}\left(-\triangledown{% \mathcal{L}}(\boldsymbol{W})^{(t)}\right)^{T}d+\frac{1}{2}\left\|d\right\|^{2}.roman_max start_POSTSUBSCRIPT bold_italic_d ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_t = 1 , … , italic_T end_POSTSUBSCRIPT ( - ▽ caligraphic_L ( bold_italic_W ) start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_d + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

In problem (80), if we drop the second order term 12𝒅212superscriptnorm𝒅2\frac{1}{2}\left\|\boldsymbol{d}\right\|^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, it intuitively tries to find the search direction 𝒅𝒅\boldsymbol{d}bold_italic_d that can maximize the minimal progress101010The progress is measured by the difference between of (t)(𝑾)superscript𝑡𝑾{\mathcal{L}}^{(t)}(\boldsymbol{W})caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) and the first order Taylor approximation of (t)superscript𝑡{\mathcal{L}}^{(t)}caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT at 𝑾+𝒅𝑾𝒅\boldsymbol{W}+\boldsymbol{d}bold_italic_W + bold_italic_d. can be made. The second order term is added to guarantee the uniqueness of the solution of problem (80). The solution 𝒅𝒅\boldsymbol{d}bold_italic_d is known as the steepest common descent direction in the optimization literature. In deep neural network applications, however, n𝑛nitalic_n can be of the billion scale, so it is very challenging to solve problem (80) directly. Instead of solving  (80), MGDA-MTL (sener2018multi) considers to the solve the dual problem

(81) minβT12t=1T[β]t(t)(𝑾)2s.t.t=1T[β]t=1 and [β]t0 for all t,subscript𝛽superscript𝑇12superscriptnormsuperscriptsubscript𝑡1𝑇subscriptdelimited-[]𝛽𝑡superscript𝑡𝑾2s.t.superscriptsubscript𝑡1𝑇subscriptdelimited-[]𝛽𝑡1 and subscriptdelimited-[]𝛽𝑡0 for all 𝑡\min_{\beta\in\mathbb{R}^{T}}\frac{1}{2}\left\|\sum_{t=1}^{T}[\beta]_{t}% \triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W})\right\|^{2}\quad\text{s.t.}% \quad\sum_{t=1}^{T}[\beta]_{t}=1\text{ and }[\beta]_{t}\geq 0\text{ for all }t,roman_min start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_β ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT s.t. ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_β ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 and [ italic_β ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 for all italic_t ,

where [β]tsubscriptdelimited-[]𝛽𝑡[\beta]_{t}[ italic_β ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the t𝑡titalic_t-th element of the vector β𝛽\betaitalic_β. One can see that the dual problem’s dimension reduces to T𝑇Titalic_T, which is usually smaller than n𝑛nitalic_n in several orders of magnitude and can be solved efficiently, e.g., Frank-Wolfe algorithm (jaggi2013revisiting) as is used in sener2018multi. The solution 𝒅superscript𝒅\boldsymbol{d}^{*}bold_italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to the problem (80) can be recovered by the solution to the problem (81) βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as 𝒅=t=1T[β]t(t)(𝑾)superscript𝒅superscriptsubscript𝑡1𝑇superscriptsubscriptdelimited-[]𝛽𝑡superscript𝑡𝑾\boldsymbol{d}^{*}=-\sum_{t=1}^{T}[\beta]_{t}^{*}\triangledown{\mathcal{L}}^{(% t)}(\boldsymbol{W})bold_italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_β ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) and the model parameter is updated as 𝑾𝑾+η𝒅𝑾𝑾𝜂superscript𝒅\boldsymbol{W}\leftarrow\boldsymbol{W}+\eta\boldsymbol{d}^{*}bold_italic_W ← bold_italic_W + italic_η bold_italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with η0𝜂0\eta\geq 0italic_η ≥ 0. With proper assumption, iterates or a subsequence of the iterates converge to a Pareto stationary point. If all {(t)}superscript𝑡\{{\mathcal{L}}^{(t)}\}{ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } are convex, then the point that the iterates converge to is not only a Pareto stationary point but also is a weak Pareto minimizer, meaning its corresponding function value vector is on the Pareto front. MGDA-MTL further developed an efficient variant of MGDA when the neural network’s parameters can be decoupled as 𝑾=(𝑾shared,𝑾(1),,𝑾(T))𝑾superscript𝑾sharedsuperscript𝑾1superscript𝑾𝑇\boldsymbol{W}=(\boldsymbol{W}^{\text{shared}},\boldsymbol{W}^{(1)},\ldots,% \boldsymbol{W}^{(T)})bold_italic_W = ( bold_italic_W start_POSTSUPERSCRIPT shared end_POSTSUPERSCRIPT , bold_italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , bold_italic_W start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ), and the common descent direction only needs to be found with respect to the 𝑾sharedsuperscript𝑾shared\boldsymbol{W}^{\text{shared}}bold_italic_W start_POSTSUPERSCRIPT shared end_POSTSUPERSCRIPT part. Another work, Nash-MTL (navon2022multi), formulates the problem of finding the common descent direction as a bargain game. Concretely, the common descent direction 𝒅𝒅\boldsymbol{d}bold_italic_d is obtained as 𝒅=Gβ𝒅𝐺𝛽\boldsymbol{d}=G\betabold_italic_d = italic_G italic_β where G=[(1)(𝑾),,(T)(𝑾)]𝐺superscript1𝑾superscript𝑇𝑾G=[\triangledown{\mathcal{L}}^{(1)}(\boldsymbol{W}),\cdots,\triangledown{% \mathcal{L}}^{(T)}(\boldsymbol{W})]italic_G = [ ▽ caligraphic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_W ) , ⋯ , ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ( bold_italic_W ) ] and β𝛽\betaitalic_β is a solution to the linear system111111 1/β1𝛽1/\beta1 / italic_β is the element-wise reciprocal. GTGβ=1/βsuperscript𝐺𝑇𝐺𝛽1𝛽G^{T}G\beta=1/\betaitalic_G start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_G italic_β = 1 / italic_β.

One potential issue with MGDA-MTL and Nash-MTL, more generally, MGDA-type methods are the algorithms that can only produce one Pareto stationary point instead of a set of Pareto stationary points. Producing a set of solutions has the advantage of allowing practitioners to choose one solution that best fits their needs. To address this issue, Pareto-MTL (lin2019pareto) considers restricting the solution 𝑾superscript𝑾\boldsymbol{W}^{*}bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT produced by one run of MGDA in a certain domain such that {(t)(𝑾)}superscript𝑡superscript𝑾\{{\mathcal{L}}^{(t)}(\boldsymbol{W}^{*})\}{ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } is on a restricted region of the Pareto front 121212This is realizable only if the solution is a weak Pareto minimizer.. By carefully crafting the regions, the algorithm can generate K𝐾Kitalic_K well-separated solutions on the Pareto front. Specifically, assuming (𝑾)0𝑾0{\mathcal{L}}(\boldsymbol{W})\geq 0caligraphic_L ( bold_italic_W ) ≥ 0 for all W𝑊\boldsymbol{W}bold_italic_W and that a set of K𝐾Kitalic_K preference vectors {𝒖k}k=1Ksuperscriptsubscriptsubscript𝒖𝑘𝑘1𝐾\{\boldsymbol{u}_{k}\}_{k=1}^{K}{ bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT are given, Pareto-MTL considered to solve the K𝐾Kitalic_K problems in parallel, where the k𝑘kitalic_kth problem is

(82) min𝑾(𝑾)=((1)(𝑾),,(T)(𝑾))Ts.t.uiT(𝑾)ukT(𝑾) for all i[K]{k}.subscript𝑾𝑾superscriptsuperscript1𝑾superscript𝑇𝑾𝑇s.t.superscriptsubscript𝑢𝑖𝑇𝑾superscriptsubscript𝑢𝑘𝑇𝑾 for all 𝑖delimited-[]𝐾𝑘\min_{\boldsymbol{W}}{\mathcal{L}}(\boldsymbol{W})=({\mathcal{L}}^{(1)}(% \boldsymbol{W}),\ldots,{\mathcal{L}}^{(T)}(\boldsymbol{W}))^{T}\quad\text{s.t.% }\quad u_{i}^{T}{\mathcal{L}}(\boldsymbol{W})\leq u_{k}^{T}{\mathcal{L}}(% \boldsymbol{W})\text{ for all }i\in[K]\setminus\{k\}.roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT caligraphic_L ( bold_italic_W ) = ( caligraphic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_W ) , … , caligraphic_L start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ( bold_italic_W ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT s.t. italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L ( bold_italic_W ) ≤ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L ( bold_italic_W ) for all italic_i ∈ [ italic_K ] ∖ { italic_k } .

where [K]={1,,K}delimited-[]𝐾1𝐾[K]=\{1,\ldots,K\}[ italic_K ] = { 1 , … , italic_K }. Intuitively, the constraints in Eq. (82) force the solution (𝑾)𝑾{\mathcal{L}}(\boldsymbol{W})caligraphic_L ( bold_italic_W ) to stay close to uksubscript𝑢𝑘u_{k}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the angular space. The problem (82) is more challenging than problem (79) since it has K1𝐾1K-1italic_K - 1 nonlinear inequality constraints. Consequently, problem (80) is changed to account for these additional K1𝐾1K-1italic_K - 1 constraints. For more details, we refer readers to Eq. (14) in lin2019pareto. However, as pointed out in exact Pareto Optimal Search (EPO search) (mahapatra2020multi), Pareto-MTL does not guarantee that the solution matches the exact preference, and K𝐾Kitalic_K needs to grow exponentially fast as T𝑇Titalic_T increases. Therefore EPO search re-designs the constraints and develops a new algorithm to search for the exact solution that matches the preference. Formally, EPO search proposes to solve

(83) min𝑾(𝑾)=((1)(𝑾),,(T)(𝑾))T s.t. (1)(𝑾)[u]1==(T)(𝑾)[u]T,subscript𝑾𝑾superscriptsuperscript1𝑾superscript𝑇𝑾𝑇 s.t. superscript1𝑾subscriptdelimited-[]𝑢1superscript𝑇𝑾subscriptdelimited-[]𝑢𝑇\min_{\boldsymbol{W}}{\mathcal{L}}(\boldsymbol{W})=({\mathcal{L}}^{(1)}(% \boldsymbol{W}),\ldots,{\mathcal{L}}^{(T)}(\boldsymbol{W}))^{T}\quad\text{ s.t% . }{\mathcal{L}}^{(1)}(\boldsymbol{W})[u]_{1}=\cdots={\mathcal{L}}^{(T)}(% \boldsymbol{W})[u]_{T},roman_min start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT caligraphic_L ( bold_italic_W ) = ( caligraphic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_W ) , … , caligraphic_L start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ( bold_italic_W ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT s.t. caligraphic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_W ) [ italic_u ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = caligraphic_L start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ( bold_italic_W ) [ italic_u ] start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ,

where uT𝑢superscript𝑇u\in\mathbb{R}^{T}italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is the user-specified preference vector, []isubscriptdelimited-[]𝑖[\cdot]_{i}[ ⋅ ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT takes the i𝑖iitalic_ith elements, and (t)(𝑾)superscript𝑡𝑾{\mathcal{L}}^{(t)}(\boldsymbol{W})caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) is non-negative for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ]. Geometrically, this constraint enforces the solution 𝑾superscript𝑾\boldsymbol{W}^{*}bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in a way such that the ray (1/u1,,1/uT)1subscript𝑢11subscript𝑢𝑇(1/u_{1},\cdots,1/u_{T})( 1 / italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , 1 / italic_u start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) intersects with the Pareto front at (𝑾)superscript𝑾{\mathcal{L}}(\boldsymbol{W}^{*})caligraphic_L ( bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Given an iterate 𝑾𝑾\boldsymbol{W}bold_italic_W, EPO search forms a search direction that tries to balance the constraint violation (the new iterate can “better" satisfy the constraint) and decrease all objective functions. Formally, the paper borrows the uniformity to measure the constraint violation by defining the non-uniformity measure μ(𝑾)=t=1T^(t)(𝑾)log(^(t)(𝑾)1/T)=KL(^(𝑾)|1T)𝜇𝑾superscriptsubscript𝑡1𝑇superscript^𝑡𝑾superscript^𝑡𝑾1𝑇KLconditional^𝑾1𝑇\mu(\boldsymbol{W})=\sum_{t=1}^{T}\hat{\mathcal{L}}^{(t)}(\boldsymbol{W})\log% \left(\frac{\hat{\mathcal{L}}^{(t)}(\boldsymbol{W})}{1/T}\right)=\textbf{KL}(% \hat{\mathcal{L}}(\boldsymbol{W})|\frac{\textbf{1}}{T})italic_μ ( bold_italic_W ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) roman_log ( divide start_ARG over^ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) end_ARG start_ARG 1 / italic_T end_ARG ) = KL ( over^ start_ARG caligraphic_L end_ARG ( bold_italic_W ) | divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) with ^(t)(𝑾)=[u]t(t)(𝑾)t=1m[u]t(t)(𝑾)superscript^𝑡𝑾subscriptdelimited-[]𝑢𝑡superscript𝑡𝑾superscriptsubscriptsuperscript𝑡1𝑚subscriptdelimited-[]𝑢superscript𝑡superscriptsuperscript𝑡𝑾\hat{\mathcal{L}}^{(t)}(\boldsymbol{W})=\frac{[u]_{t}{\mathcal{L}}^{(t)}(% \boldsymbol{W})}{\sum_{t^{\prime}=1}^{m}[u]_{t^{\prime}}{\mathcal{L}}^{(t^{% \prime})}(\boldsymbol{W})}over^ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) = divide start_ARG [ italic_u ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ italic_u ] start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( bold_italic_W ) end_ARG. One can easily check that μ(𝑾)=0𝜇𝑾0\mu(\boldsymbol{W})=0italic_μ ( bold_italic_W ) = 0 if and only if 𝑾𝑾\boldsymbol{W}bold_italic_W satisfies the constraints. EPO search shows that taking a step along the direction 𝒅1=t=1T(t)(𝑾)[uk](log(^(𝑾)/(1/m))μ(𝑾))subscript𝒅1superscriptsubscript𝑡1𝑇superscript𝑡𝑾delimited-[]subscript𝑢𝑘^𝑾1𝑚𝜇𝑾\boldsymbol{d}_{1}=\sum_{t=1}^{T}\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{% W})[u_{k}]\left(\log(\hat{\mathcal{L}}(\boldsymbol{W})/(1/m))-\mu(\boldsymbol{% W})\right)bold_italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) [ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ( roman_log ( over^ start_ARG caligraphic_L end_ARG ( bold_italic_W ) / ( 1 / italic_m ) ) - italic_μ ( bold_italic_W ) ) can reduce the non-uniformity (constraint violation). Meanwhile, the common descent direction 𝒅2subscript𝒅2\boldsymbol{d}_{2}bold_italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that reduces the all objective functions, if there exists any, takes form of Gβ𝐺𝛽G\betaitalic_G italic_β, where G=[(1)(𝑾),,(T)(𝑾)]𝐺superscript1𝑾superscript𝑇𝑾G=[\triangledown{\mathcal{L}}^{(1)}(\boldsymbol{W}),\cdots,{\mathcal{L}}^{(T)}% (\boldsymbol{W})]italic_G = [ ▽ caligraphic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_W ) , ⋯ , caligraphic_L start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ( bold_italic_W ) ] and βt0subscript𝛽𝑡0\beta_{t}\geq 0italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] and 𝟏Tβ=1superscript1𝑇𝛽1\mathbf{1}^{T}\beta=1bold_1 start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_β = 1. Then EPO search designs a linear programming problem to find a search direction 𝒅𝒅\boldsymbol{d}bold_italic_d that balances reducing constraint violation and reducing the loss functions guided by (𝒅1,𝒅2)subscript𝒅1subscript𝒅2(\boldsymbol{d}_{1},\boldsymbol{d}_{2})( bold_italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). For more details, please refer to mahapatra2020multi. Built upon EPO search, PHN (Pareto hyperNetworks) (navon2021learning) proposes to use hypernetwork, which takes the preference vector 𝒖𝒖\boldsymbol{u}bold_italic_u as the input and outputs the neural network weights for the multi-tasking, to attempt to learn the whole Pareto-front. Although the training is more challenging, if the hypernetwork could be properly trained, then at the inference time, the user can supply any preference vector 𝒖𝒖\boldsymbol{u}bold_italic_u, and the hypernetwork can output a Pareto stationary solution that closely aligns with the preference vector 𝒖𝒖\boldsymbol{u}bold_italic_u without requiring any additional efforts.

All aforementioned algorithms, despite their actual implementation, assume access to true gradients {(t)(𝑾)}t=1Tsuperscriptsubscript𝑡𝑾𝑡1𝑇\{\triangledown{\mathcal{L}}{(t)}(\boldsymbol{W})\}_{t=1}^{T}{ ▽ caligraphic_L ( italic_t ) ( bold_italic_W ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. This assumption might fail when in deep neural network settings. MoCo (multi-objective gradient correction ) (fernando2023mitigating) is proposed to address this issue. It extends MGDA to the stochastic setting, providing convergence rates for both convex and non-convex cases. The most notable challenge with extending MGDA to the stochastic setting lies in the noise of stochastic gradient estimators of true gradients {(t)(𝑾)}t=1Tsuperscriptsubscript𝑡𝑾𝑡1𝑇\{\triangledown{\mathcal{L}}{(t)}(\boldsymbol{W})\}_{t=1}^{T}{ ▽ caligraphic_L ( italic_t ) ( bold_italic_W ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The standard way to address the issue is through the variance reduction technique. Unlike the seminar work of liu2021stochastic, which achieves the variance reduction via increasing batch sizes, MoCo (fernando2023mitigating) reduces the variance via the momentum-based method, which has the advantage of keeping the batch size as small as one while still guarantee the convergence (under proper assumptions). Concretely, at the k𝑘kitalic_kth iteration, instead of solving problem (81), MoCo solves

(84) minβT12t=1T[β]tdk(t)2s.t.t=1T[β]t=1 and [β]t0 for all t,subscript𝛽superscript𝑇12superscriptnormsuperscriptsubscript𝑡1𝑇subscriptdelimited-[]𝛽𝑡superscriptsubscript𝑑𝑘𝑡2s.t.superscriptsubscript𝑡1𝑇subscriptdelimited-[]𝛽𝑡1 and subscriptdelimited-[]𝛽𝑡0 for all 𝑡\min_{\beta\in\mathbb{R}^{T}}\frac{1}{2}\left\|\sum_{t=1}^{T}[\beta]_{t}d_{k}^% {(t)}\right\|^{2}\quad\text{s.t.}\quad\sum_{t=1}^{T}[\beta]_{t}=1\text{ and }[% \beta]_{t}\geq 0\text{ for all }t,roman_min start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_β ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT s.t. ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_β ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 and [ italic_β ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 for all italic_t ,

where dk+1(t)=ProjLt[dk(t)ζt(dk(t)~(t)(𝑾k))]superscriptsubscript𝑑𝑘1𝑡subscriptProjsubscript𝐿𝑡delimited-[]superscriptsubscript𝑑𝑘𝑡subscript𝜁𝑡superscriptsubscript𝑑𝑘𝑡superscript~𝑡subscript𝑾𝑘d_{k+1}^{(t)}=\textbf{Proj}_{L_{t}}[d_{k}^{(t)}-\zeta_{t}(d_{k}^{(t)}-% \triangledown\tilde{\mathcal{L}}^{(t)}(\boldsymbol{W}_{k}))]italic_d start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = Proj start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ▽ over~ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ], where ProjLtsubscriptProjsubscript𝐿𝑡\textbf{Proj}_{L_{t}}Proj start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT projects vector to a ball centered at origin with radius Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the Lipschtiz constant of tsuperscript𝑡{\mathcal{L}}^{t}caligraphic_L start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, ζtsubscript𝜁𝑡\zeta_{t}italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is some positive constant, and ~(t)(𝑾k)superscript~𝑡subscript𝑾𝑘\triangledown\tilde{\mathcal{L}}^{(t)}(\boldsymbol{W}_{k})▽ over~ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is some approximation of (t)(𝑾k)superscript𝑡subscript𝑾𝑘\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W}_{k})▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). One can show that dk(t)(t)(𝑾k)0normsuperscriptsubscript𝑑𝑘𝑡superscript𝑡subscript𝑾𝑘0\left\|d_{k}^{(t)}-\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W}_{k})\right% \|\to 0∥ italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ → 0 as k𝑘k\to\inftyitalic_k → ∞, hence achieving the variance reduction.

To conclude this section, a comprehensive list, to our best knowledge, to include all existing optimization methods in § 2.2.5 & § 2.2.6, is summarized in Table 7.

Table 7. Algorithms for the MTL as a multi-objective optimization.
Algorithm Venue Year Method Convergence Highlight Availability1
Uncertainty Weighting CVPR kendall2018multi Dynamic Weighting Optimize {α(t)}t=1Tsuperscriptsubscriptsuperscript𝛼𝑡𝑡1𝑇\{\alpha^{(t)}\}_{t=1}^{T}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝑾𝑾\boldsymbol{W}bold_italic_W simultaneously. Official
GradNorm ICML chen2018gradnorm Dynamic Weighting
Adjust {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } is based on the average gradient norm of each tasks
and the relative progress achieved for each tasks.
Unofficial
MGDA-MTL NeurIPS sener2018multi Multi-Objective Opt. Asymptotic Convergence
Seminal work, which proposes to use MOO to solve deep MTL problems
based on multi-gradient descent algorithm.
Official
RMTL Thesis liu2018exploration Dynamic Weighting Adjust {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } is based on the relative progress achieved for each tasks. Official
LBTW AAAI liu2019loss Dynamic Weighting Adjust {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } using the reinforcement learning strategy. Official
DWA CVPR liu2019end Dynamic Weighting {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } is adapted to both samples and tasks. Official
MLDT CVPR zheng2019pyramidal Dynamic Weighting {α(t)}superscript𝛼𝑡\{\alpha^{(t)}\}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } is adapted to the likelihood of a loss reduction. Official
Pareto MTL NeurIPS lin2019pareto Multi-Objective Opt. Asymptotic Convergence Attemp to incorporate user’s preference into the solution. Official
Controllable Pareto MTL arXiv lin2020controllable Multi-Objective Opt. Use a hypernetwork to learn the entire Pareto front. Official
PCGrad NeurIPS yu2020gradient Gradient Correction Projecting onto orthogonal subspace to mitigate the gradient conflicts. Official
GradDrop NeurIPS chen2020just Gradient Correction Only keep gradients are consistent in signs in each update. Official
Continuous Pareto MTL ICML ma2020efficient Multi-Objective Opt. Construct a continuous, frst-order approximation of the local Pareto set. Official
EPO Search ICML mahapatra2020multi Multi-Objective Opt.
Find a Pareto stationary solution to exactly match a user’s preference.
Require losses to be non-negative.
Official
AuxiLearn ICLR navon2021auxiliary Bi-level Opt. Learn to combine losses in a nonlinear fashion. Official
IMTL ICLR liu2021towards Gradient Correction
Find {α(t)}t=1Tsuperscriptsubscriptsuperscript𝛼𝑡𝑡1𝑇\{\alpha^{(t)}\}_{t=1}^{T}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT such that the aggregated gradient t=1Tα(t)(t)(𝑾)superscriptsubscript𝑡1𝑇superscript𝛼𝑡superscript𝑡𝑾\sum_{t=1}^{T}\alpha^{(t)}\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W})∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ▽ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_italic_W ) has equal
projection onto the raw gradients of individual tasks.
Unofficial
GradVac ICLR wang2021gradient Dynamic Weighting Encourage more geometrically aligned parameter updates for close tasks. Unofficial
PHN ICLR navon2021learning Multi-Objective Opt. Use a hypernetwork to learn the entire Pareto front. Official
CAGrad NeurIPS liu2021conflictaverse Gradient Correction Asymptotic Convergence The search direction is find by solving a subproblem that is similar to MGDA. Official
SVGD NeurIPS liu2021profiling Multi-Objective Opt.
Convergence rate for strongly convex
and third-order continuously differentiable functions
Integrate MGDA with Stein variational gradient descent and Langevin dynamics
to obtain diverse solutions.
Official
COSMOS ICDM ruchte2021scalable
A single optimization run to approximate the full set of the Pareto front by combining
preferences vectors sampled from Dirichlet distribution and training data.
Official
HV Maximization arXiv deist2021multi Utilize hyper-volume to approximate sample level Pareto front. Official
PNG UAI ye2022optimization Multi-Objective Opt. Convergence rate for convex losses
Minimize preference loss over the Pareto front (manifold optimization)
while only using the first order information.
RLW & RGW TMLR lin2022reasonable Dynamic Weighting
Converge to a neighborhood
of the optimal solution under
strongly convex assumption.
Sample the weights {α(t)}t=1Tsuperscriptsubscriptsuperscript𝛼𝑡𝑡1𝑇\{\alpha^{(t)}\}_{t=1}^{T}{ italic_α start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT from a given distribution at each step.
Unofficial
Nash-MTL ICML navon2022multi Multi-Objective Opt. Asymptotic Convergence
Formulate the problem of finding a common descent direction as a bargaining game.
Official
(X)WC-MGDA ICML momma2022multi Dynamic Weighting Lift the restriction of non-negativity requirement on losses in EPO search.
Rotograd ICLR javaloy2022rotograd
Dynamic Weighting +
Gradient Correction
Dynamic Weighting via gradient norm
Gradient Correction via rotating the feature-space
Official
MoCo fernando2023mitigating ICLR 2023 Multi-Objective Opt.
Convergence rates for
convex & nonconvex losses
Stochastic Gradient & Variance Reduction Official
Recon ICLR shi2023recon Gradient Correction
Turn shared parameters that most likely to cause
gradient conflicts into task specific parameters.
Official
Aligned-MTL CVPR Senushkin_2023_CVPR Gradient Correction
Use the condition number of the linear system to measure
the severity of gradient dominance and conflicting issues.
Official
Achievement-based MTL ICCV yun2023achievement Dynamic Weighting
Use training progress to dynamically weight tasks and use geometric mean to average loss from tasks
Official
FULLER ICCV huang2023fuller Dynamic Weighting
Use gradient norm of different tasks to adjust the weights for tasks.
Remarks (i) The MOO approach, though it generally requires extra efforts to find the common descent directions, provides a solid framework to conduct convergence analysis. (ii) The MOO approach helps explore more diversified solutions over the Pareto front, whereas the scalarization approach cannot find the solutions on the concave part of the Pareto front. Obtaining diversified solutions helps users to understand trade-offs among a set of objectives. (iii) MOO approach gives the flexibility to incorporate user preferences in the solutions and does not require laborious tuning on task weights.

2.2.7. Adversarial training

In the era of DL, joint task modeling has shown promising success by employing feature propagation or task balancing. However, it is important to acknowledge that task-specific features do not consistently result in mutual benefits, and learning multiple loosely connected tasks simultaneously introduces irrelevant noise. While task balancing helps alleviate the negative impact of transfer learning, it neglects the information exchange between tasks, often leading to suboptimal solutions. To address this issue, adversarial training (adhikarla2022memory), as an optimization approach, can effectively disentangle the space between task-shared and -specific features by inherently preventing feature interference. This approach involves introducing a task discriminator, which distinguishes features or gradients learned from different tasks. The discriminator is trained along with a shared feature extractor to converge to a saddle point where the discriminator is unable to differentiate features or gradients learned from different tasks. Research in this field can be categorized into two main approaches based on the type of information utilized for adversarial training: representation-based and gradient-based. ASP-MTL (aka AdvMTL) (liu2017adversarial) first proposes an adversarial MTL framework to learn task-shared and -specific features independently and introduces adversarial training to make shared features invariant to the involved tasks. MTA(adv)adv{}_{(\text{adv})}start_FLOATSUBSCRIPT ( adv ) end_FLOATSUBSCRIPT(liu2018multi) presents an adversarial MTL framework in the image generation tasks, where multiple existing factors for image generation are considered as tasks and disentangled in an adversarial way with the training of shared encoder. RD4MTL (meng2019representation) employs adversarial training to encourage the features from different tasks to be disentangled and the features of irrelevant tasks to be minimally informative. GREAT4MTL (sinha2018gradient) and AAMTRL (mao2020adaptive) utilize the gradients derived from different tasks and disentangle the space using gradient reversal procedure (ganin2016domain).

Representation-Based. Adversarial Shared-Private Multi-Task Learning (ASP-MTL, aka AdvMTL) (liu2017adversarial) first proposes an adversarial MTL framework to alleviate the interference of shared and specific feature spaces among involved tasks. The underlying observation is the fact that the same word in a sentence may indicate different sentiments in different tasks, e.g. the "infantile" in product reviews "The infantile cart is simple and easy to use." and product review "This kind of humor is infantile and boring.". "infantile" is a potential backdoor word encoded in the shared feature space as it expresses a neutral attitude in the product review while it conveys a negative attitude in the movie review. ASP-MTL addresses this issue by dividing the feature space into shared and specific (private) space in a parallel manner, as shown in Fig. 7(t), and disentangles them using orthogonality constraints and adversarial losses. Let 𝒳s(t)subscriptsuperscript𝒳𝑡𝑠{\mathcal{X}}^{(t)}_{s}caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒳p(t)subscriptsuperscript𝒳𝑡𝑝{\mathcal{X}}^{(t)}_{p}caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT denote the representations of shared and private layers for the t𝑡titalic_t-th task, respectively. The adversarial training process alternates between the shared feature generator G𝐺Gitalic_G (parametrized by 𝑾ssubscript𝑾𝑠\boldsymbol{W}_{s}bold_italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) and the task discriminator D𝐷Ditalic_D (parametrized by 𝑾dsubscript𝑾𝑑\boldsymbol{W}_{d}bold_italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT) through a minimax optimization:

(85) adv=min𝑾smax𝑾dCE[D𝑾d(𝒳s(t)),𝒕(t)],t=1,,T,formulae-sequencesubscript𝑎𝑑𝑣subscriptsubscript𝑾𝑠subscriptsubscript𝑾𝑑subscript𝐶𝐸subscript𝐷subscript𝑾𝑑subscriptsuperscript𝒳𝑡𝑠superscript𝒕𝑡𝑡1𝑇{\mathcal{L}}_{adv}=\min\nolimits_{\boldsymbol{W}_{s}}\max\nolimits_{% \boldsymbol{W}_{d}}-{\mathcal{L}}_{CE}[D_{\boldsymbol{W}_{d}}({\mathcal{X}}^{(% t)}_{s}),\boldsymbol{t}^{(t)}],t=1,\cdots,T,caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , bold_italic_t start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ] , italic_t = 1 , ⋯ , italic_T ,

where 𝒕(t)superscript𝒕𝑡\boldsymbol{t}^{(t)}bold_italic_t start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT denotes the ground-truth label to indicate the task type, and CEsubscript𝐶𝐸{\mathcal{L}}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT means the use of Cross Entropy loss in practice. To further extract task invariant features from the shared layers, ASP-MTL introduces the orthogonality constraint as follows to disentangle the shared and private feature space.

(86) orth(t)=vec(𝒳s(t))vec(𝒳p(t))F2,t=1,,T,formulae-sequencesuperscriptsubscript𝑜𝑟𝑡𝑡superscriptsubscriptnormvecsuperscriptsubscriptsuperscript𝒳𝑡𝑠topvecsubscriptsuperscript𝒳𝑡𝑝𝐹2𝑡1𝑇{\mathcal{L}}_{orth}^{(t)}=\|\text{vec}({\mathcal{X}}^{(t)}_{s})^{\top}\text{% vec}({\mathcal{X}}^{(t)}_{p})\|_{F}^{2},t=1,\cdots,T,caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ∥ vec ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT vec ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_t = 1 , ⋯ , italic_T ,

where we abuse the vectorization vec()(\cdot)( ⋅ ) to preserve the sample dimension of the output feature tensors. The final learning objective function consists of three components as below:

(87) total=t=1T(spec(t)+λadv(t)+γorth(t)),subscript𝑡𝑜𝑡𝑎𝑙superscriptsubscript𝑡1𝑇superscriptsubscript𝑠𝑝𝑒𝑐𝑡𝜆superscriptsubscript𝑎𝑑𝑣𝑡𝛾superscriptsubscript𝑜𝑟𝑡𝑡{\mathcal{L}}_{total}=\sum\nolimits_{t=1}^{T}({\mathcal{L}}_{spec}^{(t)}+% \lambda{\mathcal{L}}_{adv}^{(t)}+\gamma{\mathcal{L}}_{orth}^{(t)}),caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + italic_γ caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ,

where spec(t)superscriptsubscript𝑠𝑝𝑒𝑐𝑡{\mathcal{L}}_{spec}^{(t)}caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the task-specific objective for the t𝑡titalic_t-th task, and λ𝜆\lambdaitalic_λ and γ𝛾\gammaitalic_γ are hyper-parameters to balance the learning terms. This total objective is trained with backpropagation via the advantage of gradient reversal layer (GRL) (ganin2015unsupervised).

Multi-Task Adversarial Network (MTA(adv)adv{}_{(\text{adv})}start_FLOATSUBSCRIPT ( adv ) end_FLOATSUBSCRIPTN) (liu2018multi) targets the problem of multiple factors existing in image generation. The architecture of MTA(adv)adv{}_{(\text{adv})}start_FLOATSUBSCRIPT ( adv ) end_FLOATSUBSCRIPTN is shown in Fig. 7(u), where the shared encoder E𝐸Eitalic_E extracts the features that are disentangled across style factors for the use of content classification (discriminator DCsubscript𝐷𝐶D_{C}italic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT) and generation (generator G𝐺Gitalic_G). Let the original image and the corresponding content label be represented by 𝒳𝒳{\mathcal{X}}caligraphic_X and 𝒚𝒚\boldsymbol{y}bold_italic_y, respectively. Then the training of the generation task entails the updation of shared feature extractor E𝐸Eitalic_E and generator G𝐺Gitalic_G:

(88) G=minE,G(i,j):𝒚j=𝒚i,𝒛j=𝒛iG(E(𝒳i),𝒛i)𝒳j22,subscript𝐺subscript𝐸𝐺subscript:𝑖𝑗formulae-sequencesubscript𝒚𝑗subscript𝒚𝑖subscript𝒛𝑗superscriptsubscript𝒛𝑖subscriptsuperscriptnorm𝐺𝐸subscript𝒳𝑖superscriptsubscript𝒛𝑖subscript𝒳𝑗22{\mathcal{L}}_{G}=\min\nolimits_{E,G}\sum\nolimits_{(i,j):\boldsymbol{y}_{j}=% \boldsymbol{y}_{i},\boldsymbol{z}_{j}=\boldsymbol{z}_{i}^{\prime}}\|G(E({% \mathcal{X}}_{i}),\boldsymbol{z}_{i}^{\prime})-{\mathcal{X}}_{j}\|^{2}_{2},caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_E , italic_G end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) : bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_G ( italic_E ( caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where iandj𝑖and𝑗i\text{and}jitalic_i and italic_j are data indices. 𝒛superscript𝒛\boldsymbol{z}^{\prime}bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is sampled from the style label codebook 𝒵𝒵{\mathcal{Z}}caligraphic_Z. Eq.  (88) means that the generator G𝐺Gitalic_G tries to reconstruct the data 𝒳isubscript𝒳𝑖{\mathcal{X}}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT itself if 𝒛i=𝒛isubscriptsuperscript𝒛𝑖subscript𝒛𝑖\boldsymbol{z}^{\prime}_{i}=\boldsymbol{z}_{i}bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and tries to minimize the distance between the style-transferred 𝒳isubscript𝒳𝑖{\mathcal{X}}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and any sample 𝒳jsubscript𝒳𝑗{\mathcal{X}}_{j}caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with the same content and style labels (i.e. 𝒚j=𝒚i,𝒛j=𝒛iformulae-sequencesubscript𝒚𝑗subscript𝒚𝑖subscript𝒛𝑗superscriptsubscript𝒛𝑖\boldsymbol{y}_{j}=\boldsymbol{y}_{i},\boldsymbol{z}_{j}=\boldsymbol{z}_{i}^{\prime}bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) otherwise.

The key adversarial training of style labels is defined using Earth Mover’s Distance (EMD) loss (arjovsky2017wasserstein) as follows:

(89) S=minEmaxDSiEMD(𝒙i,𝒛i)λΩGP(DS),subscript𝑆subscript𝐸subscriptsubscript𝐷𝑆subscript𝑖subscript𝐸𝑀𝐷subscript𝒙𝑖subscript𝒛𝑖𝜆subscriptΩ𝐺𝑃subscript𝐷𝑆{\mathcal{L}}_{S}=\min\nolimits_{E}\max\nolimits_{D_{S}}\sum\nolimits_{i}-{% \mathcal{L}}_{EMD}(\boldsymbol{x}_{i},\boldsymbol{z}_{i})-\lambda\Omega_{GP}(D% _{S}),caligraphic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_E italic_M italic_D end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_λ roman_Ω start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ,

where ΩGP(DS)subscriptΩ𝐺𝑃subscript𝐷𝑆\Omega_{GP}(D_{S})roman_Ω start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) is a gradient penalty term (gulrajani2017improved) for the purpose of training stability and λ𝜆\lambdaitalic_λ serves as a trade-off hyper-parameter. To add the classification of content factor, the total training objective is formulated as follows:

(90) total=minE,GG+αminEmaxDSS+βminE,DCC,subscript𝑡𝑜𝑡𝑎𝑙subscript𝐸𝐺subscript𝐺𝛼subscript𝐸subscriptsubscript𝐷𝑆subscript𝑆𝛽subscript𝐸subscript𝐷𝐶subscript𝐶{\mathcal{L}}_{total}=\min\nolimits_{E,G}{\mathcal{L}}_{G}+\alpha\min\nolimits% _{E}\max\nolimits_{D_{S}}{\mathcal{L}}_{S}+\beta\min\nolimits_{E,D_{C}}{% \mathcal{L}}_{C},caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_E , italic_G end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + italic_α roman_min start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT + italic_β roman_min start_POSTSUBSCRIPT italic_E , italic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ,

where Csubscript𝐶{\mathcal{L}}_{C}caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT denotes the Cross-Entropy loss of the content classification task, α𝛼\alphaitalic_α and β𝛽\betaitalic_β both are the hyper-parameters.

Representation Disentanglement for Multi-Task Learning (RD4MTL) (meng2019representation) aims to disentangle the indiscriminate mixing of properties in medical image analysis. As depicted in Fig. 7(v), an adversarial training process encourages the features from different tasks to be disentangled and minimally informative. Let 𝒁(t)superscript𝒁𝑡\boldsymbol{Z}^{(t)}bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT represent the latent features extracted by the specific encoder Eθ(t)subscript𝐸superscript𝜃𝑡E_{\theta^{(t)}}italic_E start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from the original image 𝒳𝒳{\mathcal{X}}caligraphic_X, then cls(t)superscriptsubscript𝑐𝑙𝑠𝑡{\mathcal{L}}_{cls}^{(t)}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as the t𝑡titalic_t-th task-specific classification loss can be calculated as follows:

(91) 𝒁(t)=Eθ(t)(𝒳),cls(t)=CE(Dϕ(t)(𝒁(t)),𝒚(t)),t=1,,T,formulae-sequencesuperscript𝒁𝑡subscript𝐸superscript𝜃𝑡𝒳formulae-sequencesuperscriptsubscript𝑐𝑙𝑠𝑡subscript𝐶𝐸subscript𝐷superscriptitalic-ϕ𝑡superscript𝒁𝑡superscript𝒚𝑡𝑡1𝑇\boldsymbol{Z}^{(t)}=E_{\theta^{(t)}}({\mathcal{X}}),{\mathcal{L}}_{cls}^{(t)}% ={\mathcal{L}}_{CE}(D_{\phi^{(t)}}(\boldsymbol{Z}^{(t)}),\boldsymbol{y}^{(t)})% ,t=1,\cdots,T,bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_E start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_X ) , caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

where 𝒚(t)superscript𝒚𝑡\boldsymbol{y}^{(t)}bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the ground truth label of the t𝑡titalic_t-th task, and CEsubscript𝐶𝐸{\mathcal{L}}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT is the Cross Entropy loss in practice. Furthermore, the adversarial regularization uses a minimax competition process as below:

(92) adv(t)=min{θ(s),ϕ(s)}stTmaxψ(t)stTCE(Dψ(t)(𝒁(s)),𝒚(s)),t=1,,T,formulae-sequencesuperscriptsubscript𝑎𝑑𝑣𝑡subscriptsuperscriptsubscriptsuperscript𝜃𝑠superscriptitalic-ϕ𝑠𝑠𝑡𝑇subscriptsuperscript𝜓𝑡superscriptsubscript𝑠𝑡𝑇subscript𝐶𝐸subscript𝐷superscript𝜓𝑡superscript𝒁𝑠superscript𝒚𝑠𝑡1𝑇{\mathcal{L}}_{adv}^{(t)}=\min\nolimits_{\{\theta^{(s)},\phi^{(s)}\}_{s\neq t}% ^{T}}\max\nolimits_{\psi^{(t)}}\sum\nolimits_{s\neq t}^{T}-{\mathcal{L}}_{CE}(% D_{\psi^{(t)}}(\boldsymbol{Z}^{(s)}),\boldsymbol{y}^{(s)}),t=1,\cdots,T,caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_s ≠ italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_ψ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ≠ italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_ψ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ) , bold_italic_y start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

then the total training objective can be formulated as follows:

(93) total=t=1T(cls(t)+λadv(t)),subscript𝑡𝑜𝑡𝑎𝑙superscriptsubscript𝑡1𝑇superscriptsubscript𝑐𝑙𝑠𝑡𝜆superscriptsubscript𝑎𝑑𝑣𝑡{\mathcal{L}}_{total}=\sum\nolimits_{t=1}^{T}({\mathcal{L}}_{cls}^{(t)}+% \lambda{\mathcal{L}}_{adv}^{(t)}),caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ,

where λ𝜆\lambdaitalic_λ balances the two loss terms.

Adaptive Adversarial Multi-Task Representation Learning (AAMTRL) (mao2020adaptive) investigates the theoretical mechanism of adversarial MTL via using Lagrangian duality, and further proposes the AAMTRL that can improve the performance of classical adversarial MTL (aka AMTRL methods in (mao2020adaptive)). For simplicity, if the shared and -private features for the t𝑡titalic_t-th are represented by 𝒳s(t)superscriptsubscript𝒳𝑠𝑡{\mathcal{X}}_{s}^{(t)}caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝒳p(t)superscriptsubscript𝒳𝑝𝑡{\mathcal{X}}_{p}^{(t)}caligraphic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, aligning with the formalization in Eq. (85). Assume the shared feature extractor E𝐸Eitalic_E (parametrized by 𝑾ssubscript𝑾𝑠\boldsymbol{W}_{s}bold_italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) and task discriminator D𝐷Ditalic_D (parametrized by 𝑾dsubscript𝑾𝑑{\boldsymbol{W}_{d}}bold_italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT) to be Bayes-optimal, AAMTRL introduces the matrix 𝑹𝑹\boldsymbol{R}bold_italic_R to measure the task relatedness, where

(94) ri,j=Dj(𝒳s(i))+Di(𝒳s(j))Di(𝒳s(i))+Dj(𝒳s(j)),subscript𝑟𝑖𝑗subscript𝐷𝑗superscriptsubscript𝒳𝑠𝑖subscript𝐷𝑖superscriptsubscript𝒳𝑠𝑗subscript𝐷𝑖superscriptsubscript𝒳𝑠𝑖subscript𝐷𝑗superscriptsubscript𝒳𝑠𝑗r_{i,j}=\frac{D_{j}({\mathcal{X}}_{s}^{(i)})+D_{i}({\mathcal{X}}_{s}^{(j)})}{D% _{i}({\mathcal{X}}_{s}^{(i)})+D_{j}({\mathcal{X}}_{s}^{(j)})},italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_ARG ,

where ri,jsubscript𝑟𝑖𝑗r_{i,j}italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the (i,j)𝑖𝑗(i,j)( italic_i , italic_j )-th entry of the matrix 𝑹𝑹\boldsymbol{R}bold_italic_R, and Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the probability that the discriminator D𝐷Ditalic_D classify the input representations as i𝑖iitalic_i-th task type. In AAMTRL, the adaptation is realized by the weighting strategy of task-specific objectives {spec(t)}t=1Tsuperscriptsubscriptsubscriptsuperscript𝑡𝑠𝑝𝑒𝑐𝑡1𝑇\{{\mathcal{L}}^{(t)}_{spec}\}_{t=1}^{T}{ caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT:

(95) spec=t=1Tαtspec(t),αt=𝟏𝑹/(𝟏𝑹𝟏).formulae-sequencesubscript𝑠𝑝𝑒𝑐superscriptsubscript𝑡1𝑇subscript𝛼𝑡subscriptsuperscript𝑡𝑠𝑝𝑒𝑐subscript𝛼𝑡1𝑹1𝑹superscript1top{\mathcal{L}}_{spec}=\sum\nolimits_{t=1}^{T}\alpha_{t}{\mathcal{L}}^{(t)}_{% spec},\alpha_{t}=\boldsymbol{1}\boldsymbol{R}/(\boldsymbol{1}\boldsymbol{R}% \boldsymbol{1}^{\top}).caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_1 bold_italic_R / ( bold_1 bold_italic_R bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) .

The classic adversarial MTRL problem can be regard as the Lagrangian dual function of the following equality-constrained optimization problem:

(96) min{𝑾s,𝑾d}spec,s.t.adv=0.subscriptsubscript𝑾𝑠subscript𝑾𝑑subscript𝑠𝑝𝑒𝑐s.t.subscript𝑎𝑑𝑣0\min\nolimits_{\{\boldsymbol{W}_{s},\boldsymbol{W}_{d}\}}{\mathcal{L}}_{spec},% \quad\quad\text{s.t.}\quad{\mathcal{L}}_{adv}=0.roman_min start_POSTSUBSCRIPT { bold_italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT , s.t. caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT = 0 .

To avoid the sub-optimal solution of the traditional Lagrangian duality in solving the problem above, an augmented Lagrangian with a quadratic form is proposed as follows:

(97) min{𝑾s,𝑾d}spec+λadv+r/2adv2,subscriptsubscript𝑾𝑠subscript𝑾𝑑subscript𝑠𝑝𝑒𝑐𝜆subscript𝑎𝑑𝑣𝑟2subscriptsuperscript2𝑎𝑑𝑣\min\nolimits_{\{\boldsymbol{W}_{s},\boldsymbol{W}_{d}\}}{\mathcal{L}}_{spec}+% \lambda{\mathcal{L}}_{adv}+r/2{\mathcal{L}}^{2}_{adv},roman_min start_POSTSUBSCRIPT { bold_italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT + italic_r / 2 caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT ,

where λ𝜆\lambdaitalic_λ is the Lagrangian multiplier, and r𝑟ritalic_r is the penalty hyper-parameter that can balance the duality gap. By using Lagrangian duality, AAMTRL can have an exact generalization error bound that is minimally investigated in the classic AMTRL.

Gradient-Based. GRadiEnt Adversarial Training for MTL (GREAT4MTL) (sinha2018gradient) is one of the scenarios of GRadiEnt Adversarial Training (GREAT) that tries to make the gradients indistinguishable across involved tasks. As depicted in Fig. 7(w), the encoder Eθsubscript𝐸𝜃E_{\theta}italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT extracts shared features for multiple tasks, and the decoders {Dϕ(t)}t=1Tsuperscriptsubscriptsubscript𝐷superscriptitalic-ϕ𝑡𝑡1𝑇\{D_{\phi^{(t)}}\}_{t=1}^{T}{ italic_D start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT are used to perform T𝑇Titalic_T involved tasks. Thus, the basic learning objectives for specific tasks are:

(98) spec(t)=minθ,{ϕ(t)}t=1T(t)(Dϕ(t)(Eθ(𝒳(t))),𝒚(t)),t=1,,T,formulae-sequencesuperscriptsubscript𝑠𝑝𝑒𝑐𝑡subscript𝜃superscriptsubscriptsuperscriptitalic-ϕ𝑡𝑡1𝑇superscript𝑡subscript𝐷superscriptitalic-ϕ𝑡subscript𝐸𝜃superscript𝒳𝑡superscript𝒚𝑡𝑡1𝑇{\mathcal{L}}_{spec}^{(t)}=\min\nolimits_{\theta,\{\phi^{(t)}\}_{t=1}^{T}}{% \mathcal{L}}^{(t)}(D_{\phi^{(t)}}(E_{\theta}({\mathcal{X}}^{(t)})),\boldsymbol% {y}^{(t)}),t=1,\cdots,T,caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_θ , { italic_ϕ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ) , bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

where {(𝒳(t),𝒚(t))}t=1Tsuperscriptsubscriptsuperscript𝒳𝑡superscript𝒚𝑡𝑡1𝑇\{({\mathcal{X}}^{(t)},\boldsymbol{y}^{(t)})\}_{t=1}^{T}{ ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is the total dataset containing T𝑇Titalic_T tasks, and (t)superscript𝑡{\mathcal{L}}^{(t)}caligraphic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is dependent on the task type. In GREAT4MTL, the Gradient-Alignment Layer (GAL) Gψsubscript𝐺𝜓G_{\psi}italic_G start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT is placed after the shared encoder and before the task-specific decoders to perform task discrimination. Unlike representation-based methods that attend to the features, Gψsubscript𝐺𝜓G_{\psi}italic_G start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT is trained using gradients from different tasks as inputs:

(99) adv=min{ϕ(t)}t=1Tmaxψt=1TCE(Gψ(Eθ(𝒳(t))Dϕ(t)spec(t),𝒚(t)),𝒕(t)),subscript𝑎𝑑𝑣subscriptsuperscriptsubscriptsuperscriptitalic-ϕ𝑡𝑡1𝑇subscript𝜓superscriptsubscript𝑡1𝑇subscript𝐶𝐸subscript𝐺𝜓subscriptsubscript𝐸𝜃superscript𝒳𝑡subscript𝐷superscriptitalic-ϕ𝑡superscriptsubscript𝑠𝑝𝑒𝑐𝑡superscript𝒚𝑡superscript𝒕𝑡{\mathcal{L}}_{adv}=\min\nolimits_{\{\phi^{(t)}\}_{t=1}^{T}}\max\nolimits_{% \psi}\sum\nolimits_{t=1}^{T}-{\mathcal{L}}_{CE}(G_{\psi}(\triangledown_{E_{% \theta}({\mathcal{X}}^{(t)})}D_{\phi^{(t)}}{\mathcal{L}}_{spec}^{(t)},% \boldsymbol{y}^{(t)}),\boldsymbol{t}^{(t)}),caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT { italic_ϕ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ▽ start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , bold_italic_t start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ,

where 𝒕𝒕\boldsymbol{t}bold_italic_t is the ground truth label to indicate the task type, and the Cross-Entropy loss CEsubscript𝐶𝐸{\mathcal{L}}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT is used to calculate the task classification error. Then the total training objective function is:

(100) total=t=1Tspec(t)+adv.subscript𝑡𝑜𝑡𝑎𝑙superscriptsubscript𝑡1𝑇superscriptsubscript𝑠𝑝𝑒𝑐𝑡subscript𝑎𝑑𝑣{\mathcal{L}}_{total}=\sum\nolimits_{t=1}^{T}{\mathcal{L}}_{spec}^{(t)}+{% \mathcal{L}}_{adv}.caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT .

The GRL is inserted before the GAL to streamline the minimax optimization process above. The trade-off hyper-parameter is eliminated in Eq. (100) by using different learning rates during the training process of specsubscript𝑠𝑝𝑒𝑐{\mathcal{L}}_{spec}caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT and advsubscript𝑎𝑑𝑣{\mathcal{L}}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT.

ASTMT (maninis2019attentive) also employs the GREAT strategy to effectively disentangle the task-shared and task-specific features acquired from the shared backbone and single-tasking components, as illustrated in the right portion of Fig. 7(n). It highlights the compatibility of GREAT to be seamlessly integrated with other frameworks.

Remarks (i) Adversarial training effectively disentangles the feature space into shared and task-specific components, ensuring that shared features remain indistinguishable across multiple tasks, while specific features retain their distinctiveness. (ii) Adversarial training can sometimes lead to unstable training dynamics, especially if the adversarial and task-specific components are not well-balanced. This can manifest as oscillations in learning or difficulty in achieving convergence.

2.2.8. Mixture of Experts (MoE)

Deep neural-based architectures have been extensively utilized in real-world MTL problems. However, the challenge of scaling high-capacity deep neural networks to adapt to multi-task settings remains conceptually appealing. The MoE (jacobs1991adaptive) framework inherently incorporates multiple expert networks, each of which can be selected for learning different tasks. The modern MoE layer (eigen2013learning; shazeer2017) has transformed the MoE module into a universally adaptable component that seamlessly integrates into various systems, including CNNs, RNNs, and Transformers, enabling plug-and-play functionality. The MoE layer, as depicted in Fig. 15(a), generally comprises a set of N𝑁Nitalic_N expert networks {En}n=1Nsuperscriptsubscriptsubscript𝐸𝑛𝑛1𝑁\{E_{n}\}_{n=1}^{N}{ italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and a gating network G𝐺Gitalic_G, whose output depends on the input data 𝒳𝒳{\mathcal{X}}caligraphic_X. This gating network generates a sparse N𝑁Nitalic_N-dimensional vector that selects the necessary expert networks to compute the final prediction as follows:

(101) 𝒚~=n=1NG(𝒳)nEn(𝒳),~𝒚superscriptsubscript𝑛1𝑁𝐺subscript𝒳𝑛subscript𝐸𝑛𝒳\tilde{\boldsymbol{y}}=\sum\nolimits_{n=1}^{N}G({\mathcal{X}})_{n}E_{n}({% \mathcal{X}}),over~ start_ARG bold_italic_y end_ARG = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_G ( caligraphic_X ) start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_X ) ,

where G(𝒳)n{0,1}𝐺subscript𝒳𝑛01G({\mathcal{X}})_{n}\in\{0,1\}italic_G ( caligraphic_X ) start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ { 0 , 1 } is the n𝑛nitalic_n-th entry of the sparse vector generated by the gating network G𝐺Gitalic_G, and 𝒚~~𝒚\tilde{\boldsymbol{y}}over~ start_ARG bold_italic_y end_ARG represents the. Beyond MoE for STL, Multi-gate Mixture-of-Experts (MMoE) (ma2018modeling) explicitly introduces multiple gates/routers ({Gt}t=1Tsuperscriptsubscriptsubscript𝐺𝑡𝑡1𝑇\{G_{t}\}_{t=1}^{T}{ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT) for each task, as shown in Fig. 15(b). The final prediction for the t𝑡titalic_t-th task is calculated as

(102) 𝒚(t)=n=1NGt(𝒳(t))nEn(𝒳(t)),t=1,,T,formulae-sequencesuperscript𝒚𝑡superscriptsubscript𝑛1𝑁subscript𝐺𝑡subscriptsuperscript𝒳𝑡𝑛subscript𝐸𝑛superscript𝒳𝑡𝑡1𝑇\boldsymbol{y}^{(t)}=\sum\nolimits_{n=1}^{N}G_{t}({\mathcal{X}}^{(t)})_{n}E_{n% }({\mathcal{X}}^{(t)}),t=1,\cdots,T,bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

where (𝒳(t),𝒚(t))superscript𝒳𝑡superscript𝒚𝑡({\mathcal{X}}^{(t)},\boldsymbol{y}^{(t)})( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) represents the sampled data from t𝑡titalic_t-th task. This prior research has inspired the development and utilization of multi-router MoE for MTL. It includes DSelect-k that selects top k𝑘kitalic_k experts for each task, MT-Tag (gupta2022sparsely), demonstrating the robustness of Multi-Router MoE to the loosely related tasks, CmoIE (wang2022multi), which constructs more insightful experts instead of incompetent ones, Mod-Squad (chen2023mod), specializing experts for specific tasks by measuring the mutual information (MI) between tasks and experts, and SummaReranker (ravaut2022summareranker), performing re-ranking on a set of summary candidates to select the best one. On the other hand, task-conditioned routing with a shared router/gate is another variant where task-dependent representations are fed into the only existing router, making their expert selections, as depicted in Fig. 15(c) for comparison. The shared-router MoE is discussed separately from the Multi-router MoE in M3ViT(fan2022m3vit). Task-level MoE (ye2022eliciting) designs different router architectures with varying complexities under shared-router settings, including MLP, LSTM, and Transformer. In both ways, task relationships are captured in different mixture patterns of experts assembling.

Refer to caption
(a) MoE.
Refer to caption
(b) Multi-Router MoE.
Refer to caption
(c) Single-Router MoE.
Figure 16. The taxonomy of (a) MoE into two categories: (b) Multi-Router MoE (c) Single-Router MoE.

Multi-Router MoE. Multi-gate Mixture of Experts (MMoE) (ma2018modeling) replaces the shared layers in the hard parameter architecture with multiple MoE layers and retains individual routers for each task, resembling the soft parameter architecture. The computational process of predicting t𝑡titalic_t-th task is shown in Eq. (102). The router networks of MMoE is the softmax of the linear transformations of the input data representation:

(103) Gt(𝑿(t))=softmax(𝑾t𝑿(t)),t=1,,T,formulae-sequencesubscript𝐺𝑡superscript𝑿𝑡softmaxsubscript𝑾𝑡superscript𝑿𝑡𝑡1𝑇\displaystyle G_{t}(\boldsymbol{X}^{(t)})=\text{softmax}(\boldsymbol{W}_{t}% \boldsymbol{X}^{(t)}),t=1,\cdots,T,italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = softmax ( bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

where 𝑾tN×Dsubscript𝑾𝑡superscript𝑁𝐷\boldsymbol{W}_{t}\in\mathbb{R}^{N\times D}bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT, and N,D𝑁𝐷N,Ditalic_N , italic_D is the number of experts and the number of features. In comparison to the soft parameter sharing architecture, MMoE features routers solely for each task, resulting in a lighter size and enhanced scalability with an increasing number of tasks. In addition, the conditional computation (bengio2013estimating; shazeer2017) of the MoE layer requires the activation of only specific parts of the experts on a per-example basis. While shazeer2017 offers a top-k𝑘kitalic_k gating function by adding tunable Gaussian noise, the theoretically scary discontinuities can lead to convergence issues if learning via gradient-based optimization.

Differentiable Selection of top-k𝑘kitalic_k experts(DSelect-k𝑘kitalic_k(hazimeh2021dselect) bridges this gap by proposing a continuously differentiable and sparse gate in the context of MMoE. Obviously, the direct cardinality constraint (0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm) on the output vector of the gate function is not amenable to SGD. To address this issue, a binary encoding scheme is introduced to realize top-k𝑘kitalic_k selection via unconstrained minimization. Let 𝒁k×m𝒁superscript𝑘𝑚\boldsymbol{Z}\in\mathbb{R}^{k\times m}bold_italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_m end_POSTSUPERSCRIPT denote a matrix that selects the top-k𝑘kitalic_k experts, whose i𝑖iitalic_i-th row 𝒛isubscript𝒛𝑖\boldsymbol{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a m𝑚mitalic_m-dimensional binary encoding of the index of any single expert, where m=log2N𝑚subscript2𝑁m=\log_{2}Nitalic_m = roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_N and N𝑁Nitalic_N is the number of total experts. The gate output vector 𝒒𝒒\boldsymbol{q}bold_italic_q is defined as follows:

(104) 𝒒𝜶,𝒁=i=1kσ(𝜶)ir(𝒛i),subscript𝒒𝜶𝒁superscriptsubscript𝑖1𝑘𝜎subscript𝜶𝑖𝑟subscript𝒛𝑖\boldsymbol{q}_{\boldsymbol{\alpha},\boldsymbol{Z}}=\sum\nolimits_{i=1}^{k}% \sigma(\boldsymbol{\alpha})_{i}r(\boldsymbol{z}_{i}),bold_italic_q start_POSTSUBSCRIPT bold_italic_α , bold_italic_Z end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_σ ( bold_italic_α ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where 𝜶k𝜶superscript𝑘\boldsymbol{\alpha}\in\mathbb{R}^{k}bold_italic_α ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is a learnable vector to control the importance of the final selected top-k𝑘kitalic_k experts, and r(𝒛i)N𝑟subscript𝒛𝑖superscript𝑁r(\boldsymbol{z}_{i})\in\mathbb{R}^{N}italic_r ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT defines the single expert selector that returns a one-hot encoding of the index of some selected expert. It is noticeable that q(α,𝒁)0ksubscriptnorm𝑞𝛼𝒁0𝑘\|q(\alpha,\boldsymbol{Z})\|_{0}\leq k∥ italic_q ( italic_α , bold_italic_Z ) ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_k and i=1Nq(α,𝒁)i=1superscriptsubscript𝑖1𝑁𝑞subscript𝛼𝒁𝑖1\sum\nolimits_{i=1}^{N}q(\alpha,\boldsymbol{Z})_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_q ( italic_α , bold_italic_Z ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, which realize the similar property for the gate output without any constraint involved. Furthermore, DSelect-k using a element-wise smoothing function S::𝑆S:\mathbb{R}\rightarrow\mathbb{R}italic_S : blackboard_R → blackboard_R to relax every binary variable in 𝒁𝒁\boldsymbol{Z}bold_italic_Z to be continuous in the range (,+)(-\infty,+\infty)( - ∞ , + ∞ ):

(105) 𝒒~𝜶,𝒁𝒒𝜶,S(𝒁~),S(z)={0,ifzγ/2,(2/γ3)z3+(3/(2γ))z+1/2,ifγ/2zγ/2,1,ifzγ/2,formulae-sequencesubscript~𝒒𝜶𝒁subscript𝒒𝜶𝑆~𝒁𝑆𝑧cases0if𝑧𝛾22superscript𝛾3superscript𝑧332𝛾𝑧12if𝛾2𝑧𝛾21if𝑧𝛾2\tilde{\boldsymbol{q}}_{\boldsymbol{\alpha},\boldsymbol{Z}}\approx\boldsymbol{% q}_{\boldsymbol{\alpha},S(\tilde{\boldsymbol{Z}})},S(z)=\begin{cases}0,&\text{% if}~{}z\leq-\gamma/2,\\ (-2/\gamma^{3})z^{3}+(3/(2\gamma))z+1/2,&\text{if}~{}-\gamma/2\leq z\leq\gamma% /2,\\ 1,&\text{if}~{}z\geq\gamma/2,\end{cases}over~ start_ARG bold_italic_q end_ARG start_POSTSUBSCRIPT bold_italic_α , bold_italic_Z end_POSTSUBSCRIPT ≈ bold_italic_q start_POSTSUBSCRIPT bold_italic_α , italic_S ( over~ start_ARG bold_italic_Z end_ARG ) end_POSTSUBSCRIPT , italic_S ( italic_z ) = { start_ROW start_CELL 0 , end_CELL start_CELL if italic_z ≤ - italic_γ / 2 , end_CELL end_ROW start_ROW start_CELL ( - 2 / italic_γ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) italic_z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + ( 3 / ( 2 italic_γ ) ) italic_z + 1 / 2 , end_CELL start_CELL if - italic_γ / 2 ≤ italic_z ≤ italic_γ / 2 , end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL if italic_z ≥ italic_γ / 2 , end_CELL end_ROW

where γ𝛾\gammaitalic_γ is a hyper-parameter that controls the width of the fractional region. Eqs. (104) and (105) transform the top-k𝑘kitalic_k selection to be unconstrained and first-order differentiable.

Multi-Task Task-aware Gating (MT-TaG) (gupta2022sparsely) designs the task-aware sparse gating function to route expert selection for each task. The incorporation of task-conditioned information into the routing mechanism is realized by constraining each embedding to only the top-1111 expert selection. Let 𝒙i(t)superscriptsubscript𝒙𝑖𝑡\boldsymbol{x}_{i}^{(t)}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT be the token/embedding representation in the i𝑖iitalic_i-th position of the input sequence for the t𝑡titalic_t-th task. A linear mapping process is first applied to obtain the touting logits below:

(106) 𝒙~i(t)=𝑾(t)𝒙i(t),t=1,,T,formulae-sequencesuperscriptsubscript~𝒙𝑖𝑡superscript𝑾𝑡superscriptsubscript𝒙𝑖𝑡𝑡1𝑇\tilde{\boldsymbol{x}}_{i}^{(t)}=\boldsymbol{W}^{(t)}\boldsymbol{x}_{i}^{(t)},% t=1,\cdots,T,over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_italic_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_t = 1 , ⋯ , italic_T ,

then the only expert routing is as follows through a softmax process:

(107) h(𝒙i(t))=maxj(e𝒙~i(t)𝟏e𝒙~i(t))jEj(𝒙i(t)),t=1,,T,h(\boldsymbol{x}_{i}^{(t)})=\max\nolimits_{j}(\frac{e^{\tilde{\boldsymbol{x}}_% {i}^{(t)}}}{\boldsymbol{1}^{\top}e^{\tilde{\boldsymbol{x}}_{i}^{(t)}}})_{j}% \cdot E_{j}(\boldsymbol{x}_{i}^{(t)}),t=1,\cdots,T,italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG italic_e start_POSTSUPERSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

where hhitalic_h denotes the task-conditioned representation calculated by the selected experts. Noticeably, the task relationship is implicitly encompassed within the variable hhitalic_h, thereby remaining independent of the experts involved. SummaReranker (ravaut2022summareranker) targets only the abstractive summarization task but utilizes different metrics to measure it. The re-ranking on a set of summary candidates generated by MMoE can consistently promote the base model.

However, the promise of MMoE has been validated in MTL with the explicit task relationship backups. Calibrated Mixture of Insightful Experts(CMoIE) (wang2022multi) investigates the negative transfer in MMoE caused by incompetent experts in certain applications. Specifically, a conflict resolution module between each pair of experts and the expert communication among the layers of different experts are introduced to advocate the diversity and capacity of experts. Additionally, a mixture calibration structure employed in the routing networks encourages the expert responsibilities to handle more tasks without losing their specialty. For any input data 𝒳𝒳{\mathcal{X}}caligraphic_X, the conflict resolution employs the Euclidean distance to measure the outputs from each pair of experts:

(108) 𝑫i,j=2(Ei(𝒳),Ej(𝒳)),i,j=1,,N,formulae-sequencesubscript𝑫𝑖𝑗subscript2subscript𝐸𝑖𝒳subscript𝐸𝑗𝒳𝑖𝑗1𝑁\boldsymbol{D}_{i,j}=\ell_{2}(E_{i}({\mathcal{X}}),E_{j}({\mathcal{X}})),i,j=1% ,\cdots,N,bold_italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_X ) , italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_X ) ) , italic_i , italic_j = 1 , ⋯ , italic_N ,

where N𝑁Nitalic_N is the number of total experts, and 𝑫N×N𝑫superscript𝑁𝑁\boldsymbol{D}\in\mathbb{R}^{N\times N}bold_italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT denotes the distance matrix between each pair of experts. Based on the max-margin t𝑡titalic_t-distribution, the corresponding conflict attention matrix for each pair of experts is calculated to highlight the excessively similar expert pairs:

(109) 𝑨i,j=1/(1+max(0,𝑫i,jRi)),i,j=1,,N,formulae-sequencesubscript𝑨𝑖𝑗110subscript𝑫𝑖𝑗subscript𝑅𝑖𝑖𝑗1𝑁\boldsymbol{A}_{i,j}=1/(1+\max(0,\boldsymbol{D}_{i,j}-R_{i})),i,j=1,\cdots,N,bold_italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 / ( 1 + roman_max ( 0 , bold_italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , italic_i , italic_j = 1 , ⋯ , italic_N ,

where Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the conflict radius of the expert Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that defines the upper quartile of {𝑫i,j}j=1Nsuperscriptsubscriptsubscript𝑫𝑖𝑗𝑗1𝑁\{\boldsymbol{D}_{i,j}\}_{j=1}^{N}{ bold_italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Furthermore, the conflict loss is proposed as follows:

(110) conflict=i=1Nj=1N(𝑨𝑫)i,j,subscript𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁subscriptdirect-product𝑨𝑫𝑖𝑗{\mathcal{L}}_{conflict}=-\sum\nolimits_{i=1}^{N}\sum\nolimits_{j=1}^{N}(% \boldsymbol{A}\odot\boldsymbol{D})_{i,j},caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f italic_l italic_i italic_c italic_t end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_italic_A ⊙ bold_italic_D ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ,

where conflictsubscript𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡{\mathcal{L}}_{conflict}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f italic_l italic_i italic_c italic_t end_POSTSUBSCRIPT is combined with multi-task loss in an end-to-end training process. To capture implicit task relationships by constructing task-aware representations, the fusion matrix 𝑭N×N𝑭superscript𝑁𝑁\boldsymbol{F}\in\mathbb{R}^{N\times N}bold_italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT is defined using multilinear map as follows:

(111) 𝑭l(t)=G(t)(𝒳(t))Hl(𝒳(t)),t=1,,T,formulae-sequencesuperscriptsubscript𝑭𝑙𝑡tensor-productsuperscript𝐺𝑡superscript𝒳𝑡subscript𝐻𝑙superscript𝒳𝑡𝑡1𝑇\boldsymbol{F}_{l}^{(t)}=G^{(t)}({\mathcal{X}}^{(t)})\otimes H_{l}({\mathcal{X% }}^{(t)}),t=1,\cdots,T,bold_italic_F start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_G start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ⊗ italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

where G(t)superscript𝐺𝑡G^{(t)}italic_G start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and Hlsubscript𝐻𝑙H_{l}italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are the routing networks and another hidden-layer gating network before the l𝑙litalic_l-th layer for the experts. Let the hidden representations at the l𝑙litalic_l-th layer of Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote by 𝒛ln,n=1,,Nformulae-sequencesuperscriptsubscript𝒛𝑙𝑛𝑛1𝑁\boldsymbol{z}_{l}^{n},n=1,\cdots,Nbold_italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_n = 1 , ⋯ , italic_N, and then stack all of them by the way of [𝒛l1,,𝒛ln]superscriptsuperscriptsubscript𝒛𝑙1superscriptsubscript𝒛𝑙𝑛top[\boldsymbol{z}_{l}^{1},\cdots,\boldsymbol{z}_{l}^{n}]^{\top}[ bold_italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , bold_italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to be the hidden representation matrix 𝒁lsubscript𝒁𝑙\boldsymbol{Z}_{l}bold_italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Through the fusion process defined in Eq. (111), the input of (l+1)𝑙1(l+1)( italic_l + 1 )-th layer of {En}n=1Nsuperscriptsubscriptsubscript𝐸𝑛𝑛1𝑁\{E_{n}\}_{n=1}^{N}{ italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is diffused by:

(112) 𝒁~l+1(t)=𝑭l(t)𝒁l(t)+𝒁l(t),t=1,,T,formulae-sequencesuperscriptsubscript~𝒁𝑙1𝑡superscriptsubscript𝑭𝑙𝑡superscriptsubscript𝒁𝑙𝑡superscriptsubscript𝒁𝑙𝑡𝑡1𝑇\tilde{\boldsymbol{Z}}_{l+1}^{(t)}=\boldsymbol{F}_{l}^{(t)}\boldsymbol{Z}_{l}^% {(t)}+\boldsymbol{Z}_{l}^{(t)},t=1,\cdots,T,over~ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_italic_F start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + bold_italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_t = 1 , ⋯ , italic_T ,

where the representation is tailored by the task-specific fusion matrix. The residual block (+𝒁l(t)superscriptsubscript𝒁𝑙𝑡+\boldsymbol{Z}_{l}^{(t)}+ bold_italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT) above can suppress the individuality ruin of experts during the fusion process. To further enhance the specialization and concentration of experts on specific tasks, the mixture calibration introduces a dynamic temperature τ(t)superscript𝜏𝑡\tau^{(t)}italic_τ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to control the logits for each routing network:

(113) G(t)(𝒳(t))=softmax(g(t)(𝒳(t))/τ(t)),t=1,,T,formulae-sequencesuperscript𝐺𝑡superscript𝒳𝑡softmaxsuperscript𝑔𝑡superscript𝒳𝑡superscript𝜏𝑡𝑡1𝑇G^{(t)}({\mathcal{X}}^{(t)})=\text{softmax}(g^{(t)}({\mathcal{X}}^{(t)})/\tau^% {(t)}),t=1,\cdots,T,italic_G start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = softmax ( italic_g start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) / italic_τ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

where the temperature parameters are progressively decreased from 1111 during the training process.

Mod-Squad (chen2023mod) also allows cooperation and specialization in the process of matching experts and tasks. To make the experts dependent on tasks, the mutual information between them is first measured as below:

(114) (𝒯;E)=t=1Tn=1NP(𝒯t,En)logP(𝒯t,En)P(𝒯t)P(En),𝒯𝐸superscriptsubscript𝑡1𝑇superscriptsubscript𝑛1𝑁𝑃subscript𝒯𝑡subscript𝐸𝑛𝑃subscript𝒯𝑡subscript𝐸𝑛𝑃subscript𝒯𝑡𝑃subscript𝐸𝑛{\mathcal{I}}({\mathcal{T}};E)=\sum\nolimits_{t=1}^{T}\sum\nolimits_{n=1}^{N}P% ({\mathcal{T}}_{t},E_{n})\log\frac{P({\mathcal{T}}_{t},E_{n})}{P({\mathcal{T}}% _{t})P(E_{n})},caligraphic_I ( caligraphic_T ; italic_E ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P ( caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) roman_log divide start_ARG italic_P ( caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_P ( italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG ,

where the joint probability will be decided by the number of data that are routed inside a task to the target expert. Then the total loss can be formulated as follows:

(115) total=t=1Tλt𝒯tγ MoE layers lI(T;El),subscript𝑡𝑜𝑡𝑎𝑙superscriptsubscript𝑡1𝑇subscript𝜆𝑡subscriptsubscript𝒯𝑡𝛾subscriptfor-all MoE layers 𝑙𝐼𝑇subscript𝐸𝑙{\mathcal{L}}_{total}=\sum_{t=1}^{T}\lambda_{t}{\mathcal{L}}_{{\mathcal{T}}_{t% }}-\gamma\sum\nolimits_{\forall\text{~{}MoE~{}layers~{}}l}I(T;E_{l}),caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_γ ∑ start_POSTSUBSCRIPT ∀ MoE layers italic_l end_POSTSUBSCRIPT italic_I ( italic_T ; italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ,

where λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the hyper parameter to control the t𝑡titalic_t-th task-specific loss 𝒯tsubscript𝒯𝑡{\mathcal{T}}_{t}caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and γ𝛾\gammaitalic_γ balances the multi-task loss term and mutual information term.

Shared-Router (Task-Conditioned) MoE. Task-Level MoE (ye2022eliciting) first uses a shared router that takes the task representation as input, which is selected from a look-up embedding table. Moreover, Task-Level MoE first investigates the combinations of different backbone (MLP, LSTM, and Transformer) and softmax (softmax, Gumbel-Softmax, and ST Gumbel-Softmax (jang2016categorical)) variations of routers. M3ViT (fan2022m3vit) customizes MoE into a ViT backbone, which compares the multi-router MoE and shared-router MoE. ViT-based MMoE can feature hardware memory efficiency, as certified in Edge-MoE (sarkar2023edge).

To circumvent the limitations associated with a fixed single expert, the AdaMV-MoE (chen2023adamv), denoted as the Adaptive Mixture of Experts framework for Multi-task Vision Recognition, possesses the capacity to autonomously ascertain the number of sparsely activated MoE based on input token embeddings. Task-specific router networks are employed to select the most relevant experts for individual tasks. This process can be mathematically expressed as:

(116) Gt(𝒙(t))=n=1Ntt(𝒙)En(𝒙),t=1,,T,formulae-sequencesubscript𝐺𝑡superscript𝒙𝑡superscriptsubscript𝑛1subscript𝑁𝑡subscript𝑡𝒙subscript𝐸𝑛𝒙𝑡1𝑇G_{t}(\boldsymbol{x}^{(t)})=\sum_{n=1}^{N_{t}}{\mathcal{R}}_{t}(\boldsymbol{x}% )\cdot E_{n}(\boldsymbol{x}),t=1,\cdots,T,italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) ⋅ italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x ) , italic_t = 1 , ⋯ , italic_T ,

where tsubscript𝑡{\mathcal{R}}_{t}caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the router for t𝑡titalic_t-th task. It should be noted that the number of experts (Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) engaged is not predefined. AdaMV-MoE incorporates an adaptive mechanism, specifically the Adaptive Expert Selection (AES) technique, to dynamically adjust this quantity based on task-specific loss values observed during validation on datasets (val(t)superscriptsubscriptval𝑡{\mathcal{L}}_{\text{val}}^{(t)}caligraphic_L start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT). If val(t)superscriptsubscriptval𝑡{\mathcal{L}}_{\text{val}}^{(t)}caligraphic_L start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT exhibits no signs of decline over several iterations, the number of experts (Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) should be augmented by 1. In contrast, if it exceeds the best loss value above, the number of experts should be reduced. Ultimately, after numerous iterations, the number of experts can be stabilized.

Remarks (i) MoE seamlessly accommodates tasks with different backbones, making it well-suited for scenarios with diverse task requirements and complexities. (ii) MoE efficiently allocates resources by assigning different experts to different tasks, avoiding redundancy and optimizing computational resources. (iii) MoE exhibits remarkable scalability, rendering it highly suitable for large-scale industry applications on the ground.

2.2.9. Graph based

Graphs have been widely used in data mining and machine learning due to their unique representation of objects and their interactions. Graph neural networks (GNNs) (sperduti1997supervised; gori2005new; scarselli2008graph; wu2020comprehensive), which leverage nodes and edges among their connected nodes in graphs to conduct inference, have gained applause with impressing performance in capturing the inter-nodes relations on graphs. It is natural to consider the tasks and corresponding data samples in MTL as nodes and their relations as the edges to construct a graph for MTL (alon2017graph). Via conducting graph mining on such graphs, relations among tasks or data samples in MTL can be better understood so as to assist the final MTL model in conducting inference (chen2019multi; cao2022relational; liu2020asymmetric; liu2022structured)

Refer to caption
Figure 17. An example of MultiKernel predicting the probability of a data sample t𝑡titalic_t belonging to task T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. MultiKernel conducts the prediction based on T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and its domains, whose hierarchical relation is extracted from the predefined tree. Specifically, ellipses are domains, and squares are tasks.

MultiKernel (widmer2010leveraging) conducts MTL over a series of classification tasks with predefined hierarchical relations, which is often the case for biological problems. Notably, it constructs a tree that reflects the hierarchical relations between tasks and domains, where leaf nodes are the tasks it studies (e.g., dog), whose parent and ancestors (non-leaf nodes) are the corresponding biological domains (e.g., mammals and animals).

For a queried task 𝒙𝒙\boldsymbol{x}bold_italic_x, MultiKernel classifies it over every task t𝑡titalic_t’s predictor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by

(117) ft(𝒙)=(𝒖t+r{t’s ancestors}λt,r𝒖r)𝒙+bt,subscript𝑓𝑡𝒙superscriptsubscript𝒖𝑡subscript𝑟𝑡’s ancestorssubscript𝜆𝑡𝑟subscript𝒖𝑟top𝒙subscript𝑏𝑡f_{t}(\boldsymbol{x})=(\boldsymbol{u}_{t}+\sum\nolimits_{r\in\{t\text{'s % ancestors}\}}\lambda_{t,r}\boldsymbol{u}_{r})^{\top}\boldsymbol{x}+b_{t},italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) = ( bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_r ∈ { italic_t ’s ancestors } end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT bold_italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x + italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where λt,rsubscript𝜆𝑡𝑟\lambda_{t,r}italic_λ start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT is a pre-calculated constant inversely related to the distance between task t𝑡titalic_t and its ancestors r𝑟ritalic_r. 𝒖tsubscript𝒖𝑡\boldsymbol{u}_{t}bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the representation of task t𝑡titalic_t. The representations of nodes within the predefined tree are learned by minimizing the task error. btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a learnable variable.

ML-GCN (chen2019multi) is a graph convolutional network (GCN)-based MTL model for capturing the label correlations in multi-label image recognition. Specifically, different from traditional MTL, ML-GCN pre-constructs a correlation matrix that reflects labels’ co-occurrence patterns within datasets. This matrix enables the system to build a label graph, where each node represents a label, and whose feature is the corresponding word embedding.

On retrieving the label graph, ML-GCN jointly trains a CNN and a GCN for the MTL. The CNN learns from image datasets to retrieve image representations, and the GCN learns from the label graph to generate label representations. ML-GCN retrieves multi-label prediction 𝒚^^𝒚\hat{\boldsymbol{y}}over^ start_ARG bold_italic_y end_ARG for an input image 𝒙𝒙\boldsymbol{x}bold_italic_x by computing dot products between image representations and label representations as 𝒚^=𝑾f(𝒙;θ)^𝒚𝑾𝑓𝒙𝜃\hat{\boldsymbol{y}}=\boldsymbol{W}\cdot f(\boldsymbol{x};\theta)over^ start_ARG bold_italic_y end_ARG = bold_italic_W ⋅ italic_f ( bold_italic_x ; italic_θ ), where f()𝑓f(\cdot)italic_f ( ⋅ ) and θ𝜃\thetaitalic_θ are the CNN model and its parameters respectively. 𝑾={𝒘(i)}i=0C𝑾superscriptsubscriptsuperscript𝒘𝑖𝑖0𝐶\boldsymbol{W}=\{\boldsymbol{w}^{(i)}\}_{i=0}^{C}bold_italic_W = { bold_italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT is the set of label representations output the GCN.

ML-GCN resorts to the traditional multi-label classification loss for training. The entire construction of ML-GCN is shown in Fig. 7(x).

MetaLink (cao2022relational) assumes that, for a given data point, at the inference time, the multi-task model has access to its labels from auxiliary tasks. Based on this assumption, MetaLink leverages labels from other tasks to improve the predictive performance. Particularly, MetaLink constructs a knowledge graph to capture not only the task-task relations as in ML-GCN but also the inter- and intra-relations between tasks and data.

The knowledge graph consists of two types of nodes: (1) data nodes, whose features are embeddings computed by the neural networks, and (2) task nodes, whose features are the last layer weights of the corresponding task-specific neural networks. Whenever a data sample belongs to a task, an edge is connected between these two nodes, and the label of the edge describes how the data point is classified in the particular task. In this way, MetaLink transfers the traditional MTL to a link prediction task between data nodes and task nodes, as shown in Fig. 7(y).

In terms of updating the entire model, MetaLink does not specify the criterion or introduce any particular regularizing terms.

Remarks (i) Graph-based representations allow tasks to be modeled as nodes and their relationships as edges. This enables the capturing of intricate dependencies and relationships among tasks, providing a more nuanced understanding compared to simplistic task relationship learning. (ii) GCN exhibits scalability in handling MTL scenarios with large number of tasks. (iii) GCN excels in information propagation across tasks within a graph structure. The interconnected nature of tasks in a graph allows for the sharing of relevant information, fostering collaborative learning.

2.2.10. Neural Architecture Search (NAS)

NAS is a popular method in designing deep neural networks automatically, which has the potential to revolutionize the way neural networks are designed and used in many different fields, including MTL. NAS in MTL refers to the use of NAS to design neural networks that can perform multiple tasks simultaneously. This is different from traditional neural network design, where a separate network is typically trained for each task. In MTL, the goal is to learn a shared representation that can be used to perform multiple tasks effectively. Conventional architecture realizes multi-tasking by hard-parameter sharing that trains multiple task heads that share shallow feature extractors, e.g., TCDCN (zhang2014facial) and Fast RCNN (girshick2014rich; girshick2015fast), or by training separate neural network to perform all each task with the shared trunk, e.g., Cross-Stitch Networks (misra2016cross) and NDDR-CNN (gao2019nddr). However, the potential design space for deep multi-task neural architectures grows exponentially with the depth, and incorporating more tasks significantly expands the range of optimal solutions.

NAS can be used as an automatic approach to search for the optimal architecture for an MTL system. This involves defining a search space that includes a range of possible architectures and using a search algorithm to explore this space and identify the best-performing architecture. The search algorithm can be based on techniques such as reinforcement learning, evolutionary algorithms, or gradient-based optimization. There are several benefits to using NAS in multi-task learning. For example, it can reduce the need for manual design of the network architecture, improve the performance of the multi-task system, and reduce the amount of data and computation required to train the network. It can also be used to identify architectures that are more efficient and easier to implement in practice.

Fully-Adaptive Feature Sharing (FAFS) (lu2017fully) is the earliest method that trains networks with an adaptive widening process. The initial network is a slimmed-down version from reducing the number of convolutional filters in CNN or neurons in MLP. It gradually expands through a multi-round widening and training procedure, facilitated by a top-down splitting algorithm. In practice, the original active layer, depicted as the L𝐿Litalic_L-th layer in Fig. 7(q), consists of numerous branches. These branches are then grouped together in the lower (L1)𝐿1(L-1)( italic_L - 1 )-th layer. Subsequently, the (L1)𝐿1(L-1)( italic_L - 1 )-th layer becomes the new active layer, and this iterative process continues from the top layers until the convergence.

Branched Multi-Task Networks (BMTN) (DBLP:conf/bmvc/VandenhendeGGB20) argues that learning layer sharing level in the early soft parameter sharing methods suffer from sub-optimal solutions, and relying solely on NAS to design the MTL architecture is significantly cumbersome. By leveraging the affinities of involved multiple tasks using Representation Similarity Analysis (RSA) (dwivedi2019representation), BMTN can automatically cluster the tasks at shared locations, in which bottom layers are task-agnostic and top layers gradually grow to be task-specific. For each task, as depicted in Fig. 7(r), BMTN initially computes the representation dissimilarity matrices (RDMs) between K𝐾Kitalic_K images at D𝐷Ditalic_D locations. The RDMs are defined as 1ρ1𝜌1-\rho1 - italic_ρ, where ρ𝜌\rhoitalic_ρ represents the Pearson correlation coefficient (pearson1895vii). Subsequently, the task affinity tensor 𝒜D×T×T𝒜superscript𝐷𝑇𝑇\mathcal{A}\in\mathbb{R}^{D\times T\times T}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_T × italic_T end_POSTSUPERSCRIPT is established based on the RDMs of all tasks using the Spearman’s correlation coefficient (spearman1961proof). Finally, BMTN is established by minimizing the sum of these task dissimilarity scores (i.e. 1𝒜d,i,j1subscript𝒜𝑑𝑖𝑗1-\mathcal{A}_{d,i,j}1 - caligraphic_A start_POSTSUBSCRIPT italic_d , italic_i , italic_j end_POSTSUBSCRIPT) between each pair of tasks i𝑖iitalic_i and j𝑗jitalic_j at every location d𝑑ditalic_d, i,j=1,,T,d=1,,Dformulae-sequence𝑖𝑗1𝑇𝑑1𝐷i,j=1,\cdots,T,d=1,\cdots,Ditalic_i , italic_j = 1 , ⋯ , italic_T , italic_d = 1 , ⋯ , italic_D.

Multi-Task Learning by Neural Architecture Search (MTL-NAS) (gao2020mtl) is a method to search cross-task edges into fixed single-task network backbones. The framework is shown in Fig. 7(s). It involves a single-shot gradient-based search algorithm that can optimize the architecture weights overall legal connections defined by the search space. Specifically, this search algorithm contains the continuous relaxation and the discretization procedures. This novel search algorithm is able to close the performance gap between search and evaluation and also generalizes the popular single-shot gradient-based methods such as DARTS (liu2018darts) and SNAS (xie2018snas).

Remarks (i) NAS facilitates the automatic and adaptive discovery of task-specific neural network architectures, departing from conventional hard or soft parameter sharing stereotypes. (ii) NAS not only searches for architectures but can also optimize hyperparameters during the search process. This automatic tuning ensures that the MTL model is configured with optimal settings for each task, reducing the need for manual fine-tuning. (iii) NAS can discover architectures that capture these dependencies effectively, allowing tasks to share information efficiently. This adaptability is crucial in scenarios where where tasks have a significant influence on each other.
Refer to caption
(a) Downstream Fine-Tuning.
Refer to caption
(b) Task Prompting.
Refer to caption
(c) Unified Generalist Model.
Figure 18. The taxonomy of PFMs of MTL into three categories: (A) Downstream Task Fine-Tuning (B) Task Prompting (C) Unified Generalist Model.

2.3. Foundation Model Era: Towards Unified and Versatile

AI models are shifting their focus from deeper networks (e.g., ConvNets (fukushima1980neocognitron; lecun1998gradient; he2016deep; liu2022convnet), GANs (goodfellow2020generative), CapsNets (sabour2017dynamic), RNNs (rumelhart1986learning; hochreiter1997long)) to foundation (e.g., BERT (devlin2018bert), GPT-4131313https://openai.com/research/gpt-4 (openai2023gpt4), SAM (kirillov2023segment), DALL\cdotE 3141414https://openai.com/dall-e-3 (ramesh2021zero)). Such foundation models leverage (usually in self-supervised, unsupervised, and assisted-manual ways) web-scale pretraining data in the wild and then adapt their backbones to different downstream tasks (bommasani2021opportunities; zhou2023comprehensive), thus inherently non-conflict towards MTL. In light of recent development of scalable learners, particularly Transformers, foundation models evolve from parameter-based transfer learning with new emergent capabilities. They facilitate the integration of multiple tasks into a pretrained backbone, achieved through only fine-tuning or even zero-shot learning (ZSL). In this context, the emergent properties in foundation models extend MTL from a fixed set of tasks (where training and test tasks are identical) to handling unknown tasks. When viewed from a task-oriented perspective, MTL, empowered by foundation models, can be categorized into three distinct types:

  1. (1)

    (Downstream) Task-Generalizable Fine-tuning. This category involves the uni-modal learning of inclusive representations in semi-supervised, self-supervised, and unsupervised learning manners. Notable examples include BiGAN (donahue2016adversarial; donahue2019large), BERT (devlin2018bert), MoCo (he2020momentum; chen2020improved; chen2021empirical), , SimCLR (chen2020simple; chen2020big), MAE (he2022masked), and GPT (radford2018improving; radford2019language; brown2020language; openai2023gpt4). The learned encoders should be transferable to a variety of downstream supervised tasks, thereby enabling them to be multi-task learners.

  2. (2)

    Task-Promptable Engineering. In this category, the original inputs are modified through task-specific prompts (e.g., SAM (kirillov2023segment)) during the pretraining stage. Prompt engineering can affect the representation of data and facilitate the learners with few-shot and even zero-shot abilities toward new tasks.

  3. (3)

    Task-Agnostic Unification. This category highlights that the representations remain unbiased toward specific tasks and data modalities via employing a unified serialization/sequence of data tokens, including Pix2Seq (chen2022pixseq; chen2022unified), UniTAB (yang2022unitab), Unified-IO (lu2022unified), Uni-Perceiver (nips_zhu2022uni; cvpr_zhu2022uni; li2023uni), OFA (wang2022ofa; bai2022ofasys), Gato (reed2022generalist), UnIVAL (shukor2023unified), etc. As a result, multi-modal learners can obtain the generalizability from existing tasks to new ones, even those involving diverse data modalities.

2.3.1. Downstream Task Fine-Tuning

At the moment of Pretrained Foundation Models (PFMs) (zhou2023comprehensive) inception, the terminology “pre-training” remained somewhat ambiguous within the field of DL research. This practice involves the initial learning of model backbones on a general dataset, e.g., ImageNet (deng2009imagenet; russakovsky2015imagenet), followed by their transfer to other tasks that commence fine-tuning with a warm-up initialization. Consequently, a similar process of “fine-tuning” before PFMs pertains to the fine-tuning of model backbones. In our context, fine-tuning with the changes of backbone parameters refers to model tuning, unless otherwise specified. It matters since PFMs are costly to backpropagate, and the ability to generalize large frozen backbone to multiple downstream tasks referred to as downstream fine-tuning, can ease this burden. By confining our discussion to the context of downstream fine-tuning within the frozen model, we can extend the previous definition of MTL (refer to Definition 3). In this context, a single model can effectively handle a set of tasks. This approach also facilitates a clear separation from the domain of (parameter-based) TL.

In the context of fine-tuning for downstream tasks facilitated by PFMs, the process typically begins with the pre-training of a backbone foundation model on large data in the wild. This pre-training phase often employs unsupervised or self-supervised methods. Subsequently, the pretrained backbone is fine-tuned using task-specific domain datasets, as illustrated in Fig. 17(a). Leveraging the task-unbiased representations acquired from the frozen backbone, fine-tuning of task-specific heads (e.g., simple MLPs for classification tasks or mask decoders for dense prediction tasks) frequently yields competitive or even superior results when compared to prior supervised outcomes across a spectrum of diverse downstream tasks.

Nonetheless, it is important to note that the pre-training phase tends to restrict data modality due to the constraints of self-supervised techniques, which are inherently data-specific. For instance, methodologies like masked image modeling (MIM) in MAE are suitable for image data, while masked language modeling (MLM) in BERT is tailored for text data. Subsequent review provides an in-depth exploration of downstream task fine-tuning methods categorized by data modality. Specifically, we will discuss these methods these methods within the domains of vision, language, and vision-language tasks.

Vision Tasks. Early pre-training techniques in computer vision primarily focus on learning from pretext tasks. Exemplar CNN (dosovitskiy2014discriminative; alexey2016discriminative), for instance, initially pretrains backbone models by discriminating various patches within unlabeled data. In the case of Inpainting (pathak2016context), the pretext task involves predicting the masked central parts of images. Colorization (zhang2016colorful), on the other hand, establishes mappings from grayscale images to their colored versions. Split-Brain Autoencoders (zhang2017split) forces the network to split into two disjoint sub-networks, each processing one-half of the input images while predicting the corresponding missing parts from the other sub-network. Recently, BEiT (bao2021beit; peng2022beit) and MAE (he2022masked) simply reconstruct the random mask patches of the images to pretrain the backbones, i.e., masked image modeling (MIM). Other MIM methods contain iBOT (zhou2021ibot), CAE (chen2023context), SimMIM (xie2022simmim), BEVT (wang2022bevt), ConMIM (yi2022masked), VideoMAE (tong2022videomae; wang2023videomae), to name a few. Jigsaw (noroozi2016unsupervised) and Completing Damaged Jigsaw Puzzles (CDJP) (kim2018learning) employ Jigsaw puzzles as pretext tasks during model pre-training. Counting (noroozi2017representation) can also serve as a pretext task to facilitate representation learning. Noise As Targets (NAT) (bojanowski2017unsupervised) focuses on learning representations by aligning the deep features of the backbone with predefined targets in a low-dimensional space. RotNet (gidaris2018unsupervised), however, is designed for predicting different image rotations. Notably, such early pre-training techniques of pretext tasks typically do not require manual annotations, allowing for fast training without the necessity of developing new loss functions. Downstream multiple tasks commonly include classification, object detection, and segmentation. Thus, parameter-efficient training (PEFT) of MTL models becomes challenging since the model must adapt to the needs of multiple tasks simultaneously. MTLoRA (agiza2024mtlora) is the first to address this problem and dominates other SOTA PEFT methods.

An alternative line of research aims to design a general representation learning algorithm that is unbiased to the pretext tasks, often referred to as contrastive self-supervised learning (SSL) (jaiswal2020survey; liu2021self). This method unlocks the potential of representations by introducing a novel loss function that hinges on the concept of “contrast.” If we denote the sets of samples that are similar and dissimilar to 𝒳𝒳{\mathcal{X}}caligraphic_X as 𝒳+superscript𝒳{\mathcal{X}}^{+}caligraphic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and 𝒳superscript𝒳{\mathcal{X}}^{-}caligraphic_X start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT respectively, the Noise Contrastive Estimation (NCE) loss (gutmann2010noise) can be defined as

(118) NCE=𝔼𝒳,𝒳+,𝒳[log(ef(𝒳)f(𝒳+))/[ef(𝒳)f(𝒳+)+ef(𝒳)f(𝒳)]],subscriptNCEsubscript𝔼𝒳superscript𝒳superscript𝒳delimited-[]superscript𝑒𝑓superscript𝒳top𝑓superscript𝒳delimited-[]superscript𝑒𝑓superscript𝒳top𝑓superscript𝒳superscript𝑒𝑓superscript𝒳top𝑓superscript𝒳{\mathcal{L}}_{\text{NCE}}=\mathbb{E}_{{\mathcal{X}},{\mathcal{X}}^{+},{% \mathcal{X}}^{-}}\left[-\log(e^{f({\mathcal{X}})^{\top}f({\mathcal{X}}^{+})})/% [e^{f({\mathcal{X}})^{\top}f({\mathcal{X}}^{+})}+e^{f({\mathcal{X}})^{\top}f({% \mathcal{X}}^{-})}]\right],caligraphic_L start_POSTSUBSCRIPT NCE end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT caligraphic_X , caligraphic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ - roman_log ( italic_e start_POSTSUPERSCRIPT italic_f ( caligraphic_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ( caligraphic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) / [ italic_e start_POSTSUPERSCRIPT italic_f ( caligraphic_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ( caligraphic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_f ( caligraphic_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ( caligraphic_X start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ] ] ,

where the function f()𝑓f(\cdot)italic_f ( ⋅ ) represents the encoder function used to learn image embedding. It is worth noting that the cosine-based similarity measurement mentioned above can be customized to suit various scenarios. Additionally, the InfoNCE loss (oord2018representation) extends this concept by incorporating a more extensive set of dissimilar pairs as

(119) InfoNCE=𝔼𝒳,𝒳+,𝒳b[log(ef(𝒳)f(𝒳+))/[ef(𝒳)f(𝒳+)+b=1B1ef(𝒳)f(𝒳b)]],subscriptInfoNCEsubscript𝔼𝒳superscript𝒳superscript𝒳𝑏delimited-[]superscript𝑒𝑓superscript𝒳top𝑓superscript𝒳delimited-[]superscript𝑒𝑓superscript𝒳top𝑓superscript𝒳superscriptsubscript𝑏1𝐵1superscript𝑒𝑓superscript𝒳top𝑓superscript𝒳𝑏{\mathcal{L}}_{\text{InfoNCE}}=\mathbb{E}_{{\mathcal{X}},{\mathcal{X}}^{+},{% \mathcal{X}}^{b}}\left[-\log(e^{f({\mathcal{X}})^{\top}f({\mathcal{X}}^{+})})/% [e^{f({\mathcal{X}})^{\top}f({\mathcal{X}}^{+})}+\sum_{b=1}^{B-1}e^{f({% \mathcal{X}})^{\top}f({\mathcal{X}}^{b})}]\right],caligraphic_L start_POSTSUBSCRIPT InfoNCE end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT caligraphic_X , caligraphic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ - roman_log ( italic_e start_POSTSUPERSCRIPT italic_f ( caligraphic_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ( caligraphic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) / [ italic_e start_POSTSUPERSCRIPT italic_f ( caligraphic_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ( caligraphic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( caligraphic_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ( caligraphic_X start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ] ] ,

where B𝐵Bitalic_B represents the batch size, comprising B1𝐵1B-1italic_B - 1 negative pairs {(𝒳,𝒳b)}b=1B1superscriptsubscript𝒳superscript𝒳𝑏𝑏1𝐵1\{({\mathcal{X}},{\mathcal{X}}^{b})\}_{b=1}^{B-1}{ ( caligraphic_X , caligraphic_X start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT and one positive pair (𝒳,𝒳+)𝒳superscript𝒳({\mathcal{X}},{\mathcal{X}}^{+})( caligraphic_X , caligraphic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ). These loss functions are closely linked to the maximization of mutual information (MI) between the encoded representations.

Many contrastive SSL methods draw from the loss functions (118) and (119) to acquire task-invariant representations. Non-parametric instance discrimination (NPID) (wu2018unsupervised) can capture apparent similarity among instances using NCE. In contrast, contrastive predictive coding (CPC) (oord2018representation; henaff2020data) first introduces the InfoNCE loss for the pre-training of RNN in an autoregressive manner. Deep InfoMax (DIM) (hjelm2018learning), Deep Graph InfoMax (DGI) (velivckovic2018deep), and Augmented Multiscale DIM (AMDIM) (bachman2019learning) take a direct approach by maximizing the MI between representations. Contrastive multiview coding (CMC) (tian2020contrastive) extends the concept of MI maximization to incorporate more than two views, MoCo (he2020momentum; chen2020improved; chen2021empirical) employs InfoNCE but introduces the momentum contrast based on a memory bank used in (wu2018unsupervised). SimCLR (chen2020simple; chen2020big) proposes a novel contrastive loss known as the normalized temperature-scaled cross-entropy loss (NT-Xent) for representation learning. Bootstrap Your Own Latent (BYOL) (grill2020bootstrap), conversely, takes a different approach by obviating the need for negative pairs. On the other hand, several other methods (caron2018deep; caron2020unsupervised; goyal2021self; li2020prototypical) endeavor to employ clustering algorithms that contrast data representations based on class prototypes.

Language Tasks. In the domain of language, initial pre-training approaches utilizing word embeddings (mikolov2013distributed; pennington2014glove) to predict subsequent tokens for a warm start have shown potential in enhancing the performance of downstream NLP tasks (dai2015semi; mccann2017learned). Nonetheless, these methods often rely on a limited dataset for pre-training, which restricts their effectiveness and prevents consistently satisfactory outcomes across the spectrum of downstream NLP tasks. Current Transformer-based Pre-trained Foundations Models (PFMs) in natural language processing can be broadly classified into three types (wang2022pre): encoder-only, decoder-only, and encoder-decoder architectures. Encoder-only architectures employ a bidirectional Transformer encoder designed to reconstruct masked tokens. Decoder-only models utilize a unidirectional Transformer decoder that predicts tokens in a left-to-right autoregressive fashion. Encoder-decoder models are crafted for sequence-to-sequence (seq2seq) generation tasks, pretrained by masking tokens in the source sequence and predicting them in the target sequence.

This taxonomy aligns with the constraints in terms of tasks. Since the encoder-only architectures, e.g., BERT (devlin2018bert), ERNIE 1.0/2.0 (sun2019ernie; sun2020ernie), SpanBERT (joshi2020spanbert), DeBERTa (he2020deberta), and GLaM (du2022glam), are pretrained to predict masked tokens based on the bidirectional context, they are better suited for understanding tasks rather than generation tasks. They are adept at tasks like document classification, named entity recognition, and question answering where the full context is available and the task is to understand or extract information rather than generate it. Encoder-only models often have a fixed maximum sequence length, which limits their ability to handle very long documents directly. They are not designed for incremental token-by-token generation and thus are inefficient for tasks that require such predictions, like text completion or interactive text generation. Conversely, decoder-only architectures, e.g., GPT-3 (brown2020language), PanGu-α𝛼\alphaitalic_α (zeng2021pangu), Turing-NLG, HyperCLOVA (kim2021changes), Gopher (rae2021scaling), LaMDA (thoppilan2022lamda), PaLM (chowdhery2022palm), Open Pre-trained Transformers (OPT) (zhang2022opt), LLaMA (touvron2023llama; touvron2023llama), PanGu-ΣΣ\Sigmaroman_Σ (ren2023pangu) and PaLM-2 (anil2023palm), are pre-trained in a unidirectional context, making them well-suited for generative tasks such as language modeling and text generation. However, this unidirectional training means they may be less effective for tasks that require understanding the full context of the input, as they can only condition on the left context. These models generate one-text token at a time, which can be slower compared to models that handle the entire input at once, and they might struggle with tasks requiring bidirectional context. Encoder-decoder Architectures, e.g., T5 (raffel2020exploring), BART (lewis-etal-2020-bart), ERNIE 3.0 (sun2021ernie), Switch Transformers (fedus2022switch) and Flan-T5 (chung2022scaling), are more flexible as they can handle both understanding and generation tasks. While they offer considerable advantages in terms of their adaptability to various tasks, they come with trade-offs in terms of model complexity, resource requirements, and potential issues with error propagation.

Vision-Language Tasks. PFMs effectively manage multiple tasks without requiring model tuning. However, the aforementioned methods remain constrained to a unimodal context. In real-world scenarios, there is a natural requirement for multimodal or cross-modal intelligence. Such intelligence should handle multiple tasks across diverse modalities and domains. Vision-Language (VL), as its name implies, bridges CV and NLP. It was among the first areas to be extensively explored by the research community for multi-modal learning in recent years. Given the intricacy and scope of VL tasks, foundation models employing vision-language pre-training (VLP) have rapidly gained prominence, showcasing notable performance. Initial VLP approaches (su2019vl; li2019visualbert; tan2019lxmert; chen2020uniter; kim2021vilt; li2021align) centered on task-specific tasks such as visual question answering (VQA), image captioning, visual grounding, etc.

The advent of the contrastive language-image pre-training (CLIP) (radford2021learning), however, marks a significant leap forward in multiple downstream tasks, as it jointly refines dual encoders to align (image, text) pairs within latent embedding space, showcasing learning SOTA multimodal representations from unstructured image-text data. The general representations by cross-modal contrastive learning validate stellar performance in zero-shot transfer across various vision-language (VL) tasks. In a similar trajectory, the Large-scale Image and Noisy-text embedding (ALIGN) (jia2021scaling) method leverages uncurated data, amplifying the efficacy of VLP in downstream cross-modal retrieval tasks. Other contrastive VLP methods contain ALBEF (li2021align), WenLan (huo2021wenlan), triple contrastive learning (TCL) (yang2022vision), and BLIP (li2022blip; li2023blip). All these methods contribute to the learning of general-purpose visual and linguistic representations, seamlessly adapting to a variety of downstream tasks ranging from cross-modal reasoning (e.g., VQA) and cross-modal matching (e.g., Image Text Retrieval and Visual Referring Expression), to vision and language generation tasks. Notably, DALL\cdot(ramesh2021zero) stands out in its remarkable capability to perform text-to-image generation tasks in a zero-shot manner, meeting commercial application standards. This underscores the potential and versatility of VLP in facilitating generalist applications.

Remarks (i) Downstream fine-tuning reduces the data requirements for downstream tasks and also the training (fine-tuning) time and resources. (ii) Downstream fine-tuning eases the intensive training burden and enhances the accessibility of PFMs, rendering them a practical solution available to anyone. (iii) Downstream fine-tuning necessitates that the data modalities for downstream tasks remain consistent with those pretrained in pretext tasks. (iv) Due to PFMs containing pretext task biases, the full potential of multi-task performance remains unrealized.

2.3.2. Task Prompting

As the evolution of PFMs advances, the incorporation of prompting into the tuning process of frozen PFMs for downstream tasks has initially become widely recognized through the name of “prompt design” (brown2020language) and subsequently carried forward through the practice of “prompt tuning.” (lester2021power) Conceptually, prompts serve as carriers of task-descriptive information, enabling the adaptation of PFMs to various tasks in a manner that can be either manually crafted or automatically generated, as illustrated in Fig. 17(b). The primary use of prompts lies in their built-in ability to significantly alleviate the demands of task-specific fine-tuning through freezing backbone parameters of PFMs and only learning task-indicating prompts, ultimately leading to enhanced few-shot or even zero-shot generalizability, all while requiring augmenting inputs and maintaining minimal to no parameter updates. A comprehensive examination of prompt taxonomy exceeds the scope of this section. Consequently, we adopt the notion of task prompting to encompass all prompt engineering methodologies within the framework of task adaptation and generalization.

The additional task-specific prompts augmented with the model can be hard and soft (gu2023systematic). The hard prompts contain task instructions or hints from human-interpretable natural language, including human instructions (radford2019language; efrat2020turking) in the early stage and more advanced In-Context Learning (ICL) (dong2022survey) and chain-of-thought (CoT) (yu2023towards; chu2023survey). The soft prompts are also referred to as continuous prompting or prompt tuning that optimizes prompts implicitly in the embedding space, which can be learned/propagated to align with specific tasks.

Hard Prompt Engineering. Large Language Models (LLMs), via making predictions based on a few examples in the context, i.e. ICL, can finally perform different tasks. This learning from demonstration and analogy are also presented as emergent abilities (wei2022emergent) in LLMs. GPT-3 (brown2020language) first verified that LLMs are few-shot learners and that different tasks can be performed given a few examples in the form of demonstration context. InstructGPT (ouyang2022training) further aligned LLMs with user intent using reinforcement learning from human feedback (RLHF). The developments in ICL contain strategies both in training stage (wei2021finetuned; chen-etal-2022-improving; min-etal-2022-metaicl; wang2022super; iyer2022opt; wei2023symbol; gu2023pre) and inference stage (liu2021makes; rubin-etal-2022-learning; gonen2022demystifying; sorensen-etal-2022-information; zhang2022active; li2023finding; lu-etal-2022-fantastically; honovich2022instruction; zhou2022least; hao2022structured; xu2023small; xu2023k). FLAN (wei2021finetuned) tuned LLMs via natural language instruction templates over 60 NLP tasks and surpassed zero-shot GPT-3 on some of the datasets. MetaICL (min-etal-2022-metaicl) introduced meta-training for ICL on a more broad spectrum (100-level) of NLP tasks. Sup-NatInst (wang2022super) presented a benchmark of 1000-level NLP tasks and proposed Tk𝑘kitalic_k-Instruct that can outperform InstructGPT with fewer parameters. OPT-IML (iyer2022opt) Scales LLMs instruction meta-learning to 2000 NLP tasks through the lens of generalization. Symbol Tuning (wei2023symbol) targets the situation when instructions or natural language are insignificant in predicting the task. PICL (gu2023pre) enhanced the ICL ability for LLMs by pre-training to maintain task generalization, while previous investigations are how to select in-context examples for better few-shot capabilities during the testing stage (liu2021makes). Other methods (gonen2022demystifying; sorensen-etal-2022-information; zhang2022active; li2023finding; lu-etal-2022-fantastically; honovich2022instruction; zhou2022least; hao2022structured; xu2023small; xu2023k) tried to understand why the performance varifies from different prompts and how to pick better prompts from different angles. After prompt retriever (rubin-etal-2022-learning) is verified efficient for ICL, many efforts used the prompt pool as a tool to support retrieval-based prompting, where relevant prompts or context are retrived for ICL (rubin2021learning; li2023unified; ye2023compositional; zhang2023makes).

Furthermore, chain-of-thought (CoT) prompts are a series of instructions with progressive orders, which can help LLMs perform complex reasoning tasks step by step (wei2022chain; kojima2022large; zhang2022automatic; fu2022complexity; ho2022large; trivedi2022interleaving; chen2022program). Manual-CoT (wei2022chain) first explores how to improve the ability of LLM by generating CoT. Zero-Shot-CoT (kojima2022large) proposes a single task-agnostic zero-shot prompt to surpass ICL even without input-output demonstrations. Complex-CoT (fu2022complexity) shows that complex reasoning chains excel simple chains. Auto-CoT (zhang2022automatic) mitigates the mistakes that could happen in precious manual ways by automatically constructing demonstrations for different questions. Fine-tune-CoT (ho2022large) can use teacher-generated reasoning to fine-tune smaller models. IRCoT (trivedi2022interleaving) interleaves retrieval with steps and, in turn, improves the ability of CoT by retrieved results. PoT (chen2022program) uses programming language statements to delegate math computations.

Soft Prompt Tuning. In comparison, soft prompt tuning can backpropagate prompt vectors using gradient descent. lester2021power introduces the concept of “prompt tuning” and distinguishes it from previous model tuning and prompt design methods. During the training, prompt tuning can refine the prompts to improve learning performance on specific tasks. Thus, the multi-task setting can be realized by simply mixing training data across different tasks. Soft Prompt Transfer (SPoT) (vu2021spot) pioneers the demonstration that prompt tuning can efficiently transfer from source to target tasks, offering a parameter-efficient approach to prompt-based transfer learning across diverse tasks. P-Tuning (liu2022p) empirically optimizes prompt tuning to be universally effective across a wide range of tasks. ATTEntional Mixtures of Prompt Tuning (ATTEMPT) (asai2022attempt) exemplifies this concept by combining multiple prompts trained on large-scale source tasks, generalizing instance-wise prompts on target tasks while keeping model parameters and source prompts frozen. Multi-task Pre-trained Modular Prompt (MP2(sun-etal-2023-multitask) enhances FSL for prompt tuning in multi-task settings. 10.1145/3583780.3614913 is the first to showcase that prompt learning achieves SOTA performance for MTL in FSL settings, even surpassing ChatGPT. Hierarchical Prompt (HiPro) learning (liu2023hierarchical) evaluates prompt tuning on standard MTL datasets and outperforms SOTA MTL methodologies by learning task-shared and task-individual prompts. Multitask Vision-Language Prompt Tuning (MVLPT) (shen2024multitask) incorporates cross-task knowledge into learning a single transferable prompt for vision-language models (VLMs). Prompt Guided Transformer (PGT) (lu2024prompt) introduces a prompt-conditioned Transformer block, integrating task-specific prompts into the self-attention mechanism, achieving global dependency modeling and parameter-efficient feature adaptation across multiple tasks. PromptonomyViT (PViT) model, as introduced in herzig2024promptonomyvit, leverages prompts to capture task-specific information in video Transformers.

Prefix-tuning li2021prefix is another lightweight alternative to fine-tune LLMs for different tasks while also keeping model parameters frozen. Prefix-tuning learns a continuous task-specific vector prefixed to the subsequent tokens. It can obtain comparable performance in the full data setting and outperform fine-tuning in low-data settings. chen2022unisumm proposes a Unified few-shot Summarization (UniSumm) model pretrained on multiple text summarization tasks, which exhibits the capability to generalize to different few-shot tasks through the utilization of prefix-tuning. chong2023leveraging trains a prefix transfer module to selectively leverage the knowledge from various prefixes according to the input text. Collaborative domain-Prefix tuning for cross-domain NER (CP-NER) (chen2023one) utilizes text-to-text generation, grounding domain-related instructions to transfer knowledge to new domain tasks. Prefix-tuning approaches highlight the importance of leveraging prefixes and domain-specific information for improving performance in multiple tasks.

Remarks (i) Task prompting stands out as highly parameter-efficient, demanding fewer than 0.01%percent0.010.01\%0.01 % of task-specific parameters even for models exceeding a billion parameters (lester2021power). (ii) Task-specific prompts exhibit a remarkable degree of adaptability, affording the capacity for on-the-fly customization to accommodate a diverse set of tasks, thus enhancing the flexibility in managing a multitude of heterogeneous tasks simultaneously. (iii) Task prompting facilitates the achievement of few-shot and even zero-shot learning, empowering PFMs to effectively perform tasks with minimal to no examples. (iv) Researchers/practitioners can have fine-grained control over how the model performs different tasks, as prompts can be customized to guide the model behavior precisely. (v) The prompt itself is not transferable across different PFMs, thus leading to the limitations of scalability and reusability of prompt designs. (vi) Human involvement in prompting, e.g., crafting prompts or selecting appropriate templates, is time-consuming and bias-inducing.

2.3.3. Unified Generalist Models

The ambitious aspiration, shared by both research communities and industries, has always been to transition from specialization to unification, thereby constructing an ideal generalist model capable of addressing a diverse set of tasks with varying modalities. The advent of large language models (LLMs)

The blueprint of designing general-purpose multimodal foundation models aligns with the recent unified models such as Gato (reed2022generalist), Unified-IO (lu2022unified), and OFA (wang2022ofa), Uni-Perceiver (zhu2022uni; li2023uni), etc. These methods can perform a variety of tasks spanning from CV to NLP, without modality limitations. Please see Fig. 19 as an illustration.

Refer to caption
Figure 19. The Framework of unified generalist model, which can unify architectures, tasks, and modalities through a simple seq2seq learning architecture.

To pretrain via a Transformer backbone for the general MTL usage, we need to tokenize the input multi-modal data. For images, the commmon practice should obey the sequencing of non-overlapping 16×16161616\times 1616 × 16 patches in raster order in ViT (dosovitskiy2020image), with the size of 256/1625616256/16256 / 16 for each patch. Typically, the bounding boxes of objects in region-based tasks are represented by the quantization scheme of Pix2Seq (chen2022pixseq). In the text preprocessing, the OFA framework adopts the exact same BPE Tokenizer (sennrich2015neural) used in BART (lewis-etal-2020-bart), and its tokens are originally ordered along with the raw input text. Based on this prepossessing, it is possible to build a unified vocabulary for all visual, linguistic, and multi-modal tokens. After that, suppose we are given a sequence of tokens 𝐱i,bsubscript𝐱𝑖𝑏\mathbf{x}_{i,b}bold_x start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT as input, where i=1,,I𝑖1𝐼i=1,\cdots,Iitalic_i = 1 , ⋯ , italic_I indexes the tokens in a data sample and b=1,,B𝑏1𝐵b=1,\cdots,Bitalic_b = 1 , ⋯ , italic_B indexes a sample in a training batch. The architecture for a unified model is parametrized by θ𝜃\thetaitalic_θ. Then we are able to autoregressively train the model via the chain rule as follows:

(120) θ(𝐱1,1,,𝐱i,b)=b=1Blogi=1Ipθ(𝐱i,b|𝐱1,b,,𝐱i1,b)=b=1Bi=1Ilogpθ(𝐱i,b|𝐱<i,b)subscript𝜃subscript𝐱11subscript𝐱𝑖𝑏superscriptsubscript𝑏1𝐵superscriptsubscriptproduct𝑖1𝐼subscript𝑝𝜃conditionalsubscript𝐱𝑖𝑏subscript𝐱1𝑏subscript𝐱𝑖1𝑏superscriptsubscript𝑏1𝐵superscriptsubscript𝑖1𝐼subscript𝑝𝜃conditionalsubscript𝐱𝑖𝑏subscript𝐱absent𝑖𝑏\displaystyle\mathcal{L}_{\theta}(\mathbf{x}_{1,1},\cdots,\mathbf{x}_{i,b})=% \sum_{b=1}^{B}\log\prod_{i=1}^{I}p_{\theta}(\mathbf{x}_{i,b}|\mathbf{x}_{1,b},% \cdots,\mathbf{x}_{i-1,b})=\sum_{b=1}^{B}\sum_{i=1}^{I}\log p_{\theta}(\mathbf% {x}_{i,b}|\mathbf{x}_{<i,b})caligraphic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_log ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 , italic_b end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_i - 1 , italic_b end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT < italic_i , italic_b end_POSTSUBSCRIPT )
Remarks (i) The unified generalist model allows for modality-agnostic and task-agnostic learning, overcoming the limitations inherent to specific tasks. This implies that any task can be modeled into an omnivorous model. (ii) The unified generalist model achieves parameter efficiency and saves storage space in terms of many tasks. (iii) The unified generalist model is pre-trained using multimodal data all at once but possesses enduring utility.

The concept of a unified architecture for multi-modal MTL can be traced back to OmniNet (pramanik2019omninet), taking insights from the potentials of Transformers such as, pramanik2019omninet propose a single model in their work to support tasks with multiple input modalities as well as asynchronous MTL. lu202012 investigates the relationships between vision-language (VL) tasks, and proposes a single model targeting 12 datasets simultaneously. li2021towards introduces the concept of unified foundation models by jointly pre-training Transformers on unpaired images and text data. Unified Transformer (UniT) model (hu2021unit) is a realization of this concept. It first features separate encoders for different input modalities and a shared decoder over the encoded input representations. Each task is associated with specific heads in the shared decoder. Unified Foundation Model wang2022ofa; bai2022ofasys proposes One-for-All (OFA) as a task-agnostic and modality-agnostic framework. OFA aims to unify task-specific layers for downstream tasks, providing a versatile solution. However, it is important to note that OFA currently lacks support for video data and necessitates fine-tuning for downstream tasks. Uni-Perceiver (zhu2022uni) is a unified architecture for generic perception for zero-shot and few-shot tasks, which includes a video tokenizer with temporal positional embeddings. Uni-Perceiver v2 (li2023uni) further introduces task-balanced gradient normalization to ensure stable MTL, which enables larger batch-size training for various tasks. More importantly, unlike OFA (wang2022ofa), Uni-Perceiver v2 requires no task-specific adaptation. Mask DETR with Improved deNoising anchOr boxes (Mask DINO) (li2023mask) is a unified framework designed for object detection and segmentation. Mask DINO uses an additional mask prediction branch to unify the query selection for masks. All-in-one Transformer (wang2023all) unifies video and text encoders via introducing a token rolling operation to encode temporal representations from videos. Omnivorous Masked Auto-Encoder(OmniMAE) (girdhar2023omnimae) shows that MAE can be used to pretrain a ViT on images and videos without any human labels. OmniVec (srivastava2024omnivec) also pretrains a unified architecture from self-supervised masked data, including visual, audio, text, and 3D, which realizes the cross-modal task generalization.

3. Miscellaneous

3.1. Fairness and Bias in MTL

While most of the existing research about bias and fairness implications primarily focuses on STL (mehrabi2021survey), wang2021understanding pioneer the exploration of the fairness-accuracy trade-off within the MTL setting. The challenge of unaligned fairness goals arises in MTL models that optimize accuracy for all tasks. The introduction of novel multi-task fairness metrics, such as average relative fairness gap and average relative error, aids in quantifying this trade-off in MTL applications. li2023fairness emphasize that misspecification of majority and minority groups in involved tasks disproportionately affects minority tasks, and they propose over-parameterization as a viable solution to achieve fairness by covering all tasks. hu2023fairness extend the definition of Strong Demographic Parity (agarwal2019fair; jiang2020wasserstein) to MTL using multi-marginal Wasserstein barycenters (chzhen2020fair), providing an optimal fair multi-task solution to the fairness-accuracy trade-off. Additionally, roy2022learning further demonstrates that improving fairness can positively impact accuracy performance. Learning to Teach Fair Multi-Tasking (L2T-FMT) (roy2022learning) introduces a teacher-student network to address fair MTL problems. In this framework, the teacher guides the student in selecting fairness or accuracy objectives during training, offering a dynamic approach to balancing these objectives. Drawing an analogy, roy2023fairbranch liken the negative impact of task-specific fairness to negative transfer and introduces FairBranch, a method that groups related tasks to mitigate this negative transfer through fairness loss gradient conflict correction. In recent years, prioritizing fair MTL to mitigate biases arising from negative transfer has emerged as a promising direction. This approach can ensure that models treat all tasks fairly, avoiding disproportionate impacts on specific groups or tasks. By preventing biased outcomes, fair MTL contributes to averting potential societal harm.

3.2. Security and Privacy in MTL

Attack and Defense. MTL is an impactful technique employed to bolster attacks in diverse sectors. It notably expedites the creation of adversarial examples for numerous tasks simultaneously through the exploitation of task-shared knowledge (guo2020multi). In the field of automatic speaker verification, multi-task learning strategies have been utilized to identify replay attack spoofing and to classify different types of replay noise (shim2018replay). With regard to reinforcement learning, the vulnerability of multi-task federated reinforcement learning algorithms to adversarial attacks has been examined, resulting in the development of an adaptable attack method and a refined federated reinforcement learning algorithm (anwar2021multi). Additionally, within the realm of deep reinforcement learning, a multi-objective strategy for developing attack policies has been suggested, considering both the performance degradation and the cost related to the attack (garcia2020learning). Conversely, MTL can also serve as a means to heighten the model’s resilience, leading to an improved defense against a wide array of malicious attacks. For instance, the robustness of models to adversarial attacks on individual tasks has been shown to increase when models are trained on multiple tasks concurrently (mao2020multitask; guo2020multi). Likewise, multi-task learning has been employed for adversarial defense (naseer2022stylized), using supplementary data from the feature space to design more formidable adversaries and boost the model’s resilience. Through the utilization of multi-task objectives, such as cross-entropy loss, feature-scattering, and margin losses, more powerful perturbations can be devised for adversarial training. This technique has been used in several domains, such as computer vision and speech recognition, and has demonstrated enhanced adversarial accuracy and resilience (pal2021adversarial; chan2021multiple).

Privacy-preserving. Privacy-preserving multi-task learning (PP-MTL) (liu2018privacy) aims to ensure the confidentiality of sensitive data and boost learning outcomes by facilitating knowledge transfer across related tasks. PP-MTL algorithms employ cryptographic mechanisms to safeguard data residing across various locations or nodes, using these to relay cumulative data - for instance, gradients or supports - to a centralized server where the aggregated data is processed to create the desired models. Existing strategies cannot deliver a demonstrable or verifiable security assurance for the transferred cumulative data. To tackle this shortcoming, various innovative PP-MTL protocols have been suggested, leveraging cutting-edge cryptographic methods to deliver the strongest possible security assurance (liu2018privacy). Furthermore, differential private stochastic gradient descent algorithms have been employed to optimize the comprehensive multi-task model and safeguard the privacy of training data by introducing appropriately calibrated noise to the gradient of loss functions (zhang2020privacy). To maintain the privacy of distributed data, privacy-preserving distributed MTL frameworks have been introduced, incorporating a privacy-preserving proximal gradient algorithm. This algorithm updates models asynchronously and offers guaranteed differential privacy (xie2017privacy).

Federated Learning. Federated Multi-task Learning (FMTL) (smith2017federated) represents a platform for training machine learning models over distributed device networks. By personalizing models for individual clients, it successfully navigates the statistical complexities posed by federated learning, given the heterogeneity of local data distributions (smith2017federated). It effectively manages high communication overhead, lags, and reliability in distributed multi-task learning (marfoq2021federated). The efficacy of FMTL has been demonstrated on real-world federated datasets, even with non-convex models (sarcheshmehpour2021networked). It can be utilized in both a central server-client and a fully decentralized structure and provides the capacity to serve personalized models to clients unseen during training (corinzia2019variational). Furthermore, the over-the-air computation can be integrated within FMTL to enhance system efficiency, reducing channel usage without a substantial drop in learning performance (ma2022over).

3.3. Distribution Shifts in MTL

While Multi-Task Learning (MTL) excels at leveraging shared information to boost individual task performance (1.3), its real-world applicability often hinges on its ability to adapt to unforeseen data distributions. Distribution shifts, where the data encountered during deployment deviates from the training distribution, are omnipresent challenges that can significantly degrade MTL performance, especially on new tasks or domains. Recognizing and mitigating these shifts is crucial not just for maintaining the generalizability and resilience of MTL models but also for unlocking their full potential in real-world applications.

Recent research offers a diverse arsenal of approaches to tackle distribution shifts in MTL. Vision Transformer Adapters (ViTA) (bhattacharjee2023vision) introduce dedicated modules within the model architecture that enhance adaptability to diverse tasks and data distributions. Techniques like regularizing spurious correlations (hu2022improving) target misleading associations between tasks, reducing their influence on the overall model performance. Scalarization methods provide a scalable framework for handling the complexities of multi-task and multi-domain learning while facing distribution shifts (royer2023scalarization). Multi-objective learning strategies, exemplified by approaches addressing catastrophic forgetting in time-series applications (10.1145/3502728), strive to mitigate the issue of forgetting previously learned skills when encountering new data. Finally, techniques like reward modeling (faal2023reward) demonstrate their versatility in addressing distribution shifts, as seen in mitigating toxicity issues in transformer-based language models. This array of advancements underscores the ongoing efforts to equip MTL models with enhanced adaptability and resilience to varying task distributions, ultimately paving the way for their reliable and widespread real-world application.

Looking ahead, the evolving landscape of MTL research envisions models that not only react to distribution shifts but proactively anticipate and address them. As highlighted in a recent comprehensive study (adhikarla2023robust), understanding and mitigating distribution shifts are becoming paramount for MTL’s success. The ability to navigate diverse and dynamic data distributions is crucial for the broader deployment of MTL in complex, real-world scenarios. By advancing techniques that enhance adaptability and robustness, researchers are striving to empower MTL models to excel in the face of evolving task and domain landscapes, unlocking their potential to revolutionize a wide array of applications.

3.4. Non-supervised MTL

semi-supervised learning. Supervised learning has been a fundamental technique in machine learning in recent years. However, it faces the limitation of requiring a substantial amount of labeled data to yield promising results, a process that is both time-consuming and costly. To mitigate this, semi-supervised learning has been introduced, leveraging the diverse array of unlabeled datasets to reduce the dependence on labeled data. Previous existing semi-supervised algorithms are not often amenable to MTL, for instance, (liu2007semi) introduces a semi-supervised multitask learning (MTL) framework, featuring M𝑀Mitalic_M parameterized classifiers. Each classifier is associated with a partially labeled data manifold and is jointly learned under a soft-sharing prior that influences their parameters. This approach effectively utilizes unlabeled data by basing the learning of classifiers on neighborhood structures. Besides, (augenstein2018multi) presents a method that models the relationship between labels by inducing a joint label embedding space for multi-task learning and proposes a TranferNetwork𝑇𝑟𝑎𝑛𝑓𝑒𝑟𝑁𝑒𝑡𝑤𝑜𝑟𝑘TranferNetworkitalic_T italic_r italic_a italic_n italic_f italic_e italic_r italic_N italic_e italic_t italic_w italic_o italic_r italic_k which learns to transfer labels between tasks and uses semi-supervised learning to leverage them for training. In real-world applications, multi-task regression is a prevalent challenge. (zhang2009semi) proposes the SMTR method, which is grounded in Gaussian Processes (GP). This method operates under the assumption that the kernel parameters for all tasks share a common prior. To enhance SMTR, the approach incorporates unlabeled data by modifying the GP prior’s kernel function into a data-dependent one. This modification leads to a semi-supervised extension of the original SMTR method, aptly named SSMTR. Additionally, (chen2020multi) introduces a multi-task mean teacher model for semi-supervised shadow detection, effectively utilizing unlabeled data and simultaneously learning multiple aspects of shadows. Specifically, they construct a multi-task baseline model designed to detect shadow regions, edges, and count, leveraging the complementary information of these elements. This baseline model is then implemented in both student and teacher networks. The approach further involves aligning the predictions from the three tasks across these networks, using this alignment to compute a consistency loss on unlabeled data. This loss is combined with the supervised loss from labeled data based on the predictions of the multi-task baseline model, thereby enhancing the model’s learning effectiveness. (nguyen2019multi) proposed a network employing a multi-task learning approach to detect manipulated images and videos and to identify the manipulated regions within each query. To enhance the network’s generalizability, a semi-supervised learning approach is integrated in which the architecture comprises an encoder and a Y-shaped decoder. The activation of encoded features facilitates binary classification. Meanwhile, the outputs of the decoder’s branches serve distinct purposes: one for segmenting the manipulated regions and the other for reconstructing the input. This dual functionality significantly contributes to the improvement of the overall performance of the network. Semi-supervised multitask learning (MTL) has emerged as a popular field, with various preceding studies, as mentioned above, that propose different mechanisms that integrate semi-supervised concepts. These studies have demonstrated their effectiveness through numerous experimental results. Despite these advancements, there remains a substantial scope for further research in this subfield. Continued exploration in semi-supervised MTL promises to yield many more valuable insights and findings.

unsupervised learning. Moving beyond the realm of semi-supervised learning, the real-world often presents scenarios where obtaining labeled data of all tasks in MTL learning is not feasible, underscoring the significance of unsupervised learning in the field of multitask learning (MTL). OpenAI, in their groundbreaking study by (radford2019language), introduced the widely acclaimed GPT model, demonstrating a significant advancement in multitask learning (MTL) within the field of natural language processing. Their research showed that language models begin to autonomously learn a variety of MTL tasks - including question answering, machine translation, reading comprehension, and summarization - without the need for explicit supervision. This capability was notably observed when the GPT model was trained on WebText𝑊𝑒𝑏𝑇𝑒𝑥𝑡WebTextitalic_W italic_e italic_b italic_T italic_e italic_x italic_t, a vast new dataset comprising millions of webpages. This development highlights a major stride in the field, showcasing the potential of large language models to adapt to a wide array of tasks through extensive unsupervised learning. Besides, to alleviate the limitation of existing clustering approaches that neglect the underlying relationship and treat these clustering tasks either individually or simply together, the study by (5360241) introduces an innovative clustering approach called Multitaskclustering𝑀𝑢𝑙𝑡𝑖𝑡𝑎𝑠𝑘𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔Multi-taskclusteringitalic_M italic_u italic_l italic_t italic_i - italic_t italic_a italic_s italic_k italic_c italic_l italic_u italic_s italic_t italic_e italic_r italic_i italic_n italic_g, which conducts several related clustering tasks concurrently and leverages the relationships between these tasks to improve clustering performance. This approach comprises two key components: (1) Within-task clustering, which involves clustering the data for each task individually within its own input space, and (2) Cross-task clustering, where the shared subspace is learned simultaneously, and the data from all tasks are clustered together. This dual-faceted strategy optimizes the clustering results by combining individual task insights with cross-task synergies. Another notable example is in the context of point cloud tasks, where (hassani2019unsupervised) introduces an unsupervised multi-task model. This model is designed to concurrently learn point and shape features. It incorporates three unsupervised tasks: clustering, reconstruction, and self-supervised classification. These tasks are used to train a multi-scale graph-based encoder. Beyond, (argyriou2006multi) introduces a method for learning a low-dimensional representation shared across multiple related tasks. This method extends the well-known 1-norm regularization problem by incorporating a novel regularizer that controls the number of features common to all tasks. The authors demonstrate that this approach can be formulated as a convex optimization problem and develop an iterative algorithm to solve it. The algorithm operates in a dual-step manner: it alternates between a supervised step and an unsupervised step. In the unsupervised step, it learns representations common across tasks, while in the supervised step, it utilizes these common representations to learn task-specific functions. This approach effectively combines supervised and unsupervised learning techniques to enhance multi-task learning.

3.5. Others

3.5.1. Applications of MTL

In the DL era, the advancement of multimodal analysis and MTL paradigms has brought challenges and also opened up fantastic probabilities to the realm of MTL. In addition to the applications investigated in the paper, MTL plays an important role in many different fields such as visual assessment (yu2019towards; zhang2023blind), healthcare(zhang2023knowledge; zhao2023multi; zhang2023biomedgpt), transportation(wang2023multi; feng2023forecast), language models (liu2020multi; hu2021unit) and recommender systems(zhang2023advances; deng2023unified). Briefly,zhang2023blind develop a general and automated multitask learning scheme for image quality assessment by blind individuals. zeng2023new combine MTL algorithms with a deep belief network for the diagnosis of Alzheimer’s disease. Wang et al. (wang2023multi) propose a multi-task Weakly supervised learning framework to infer transition probability between road segments. gao2023enhanced utilize the relation-aware GCNs to fully capture the multi-relation neighborhood features.

Despite the achievements in recent years, many outstanding MTL approaches still suffer from limitations that restrict their application to certain real-world scenarios. For example, it is difficult to capture the complex inter-scenario correlations with multiple tasks. Besides, in large-scale tasks, it remains a challenge to design scalable models and deal with the parameter explosion issue. Therefore, the scalability of MTL models is still a direction worth exploring (zhang2023advances).

3.5.2. MTL+X

MTL + Continual Learning. Biased forgetting of previous knowledge caused by new tasks remains challenging in continual learning. lyu2021multi propose Multi-Domain Multi-Task (MDMT) rehearsal to train the old tasks and new tasks together while keeping tasks from isolation. he2019task utilize meta-learning to achieve task-agnostic continual learning. MTL is a promising technique to mitigate catastrophic forgetting via learning task-relatedness.

Multi-Task Reinforcement Learning (MTRL). MTRL (vithayathil2020survey) holds promise in the context of Reinforcement Learning (RL), given the natural presence of diverse tasks like reach, push and pick in robotic manipulation. In the early stage, wilson2007multi approaches it as the solution to a sequence of Markov Decision Processes (MDPs) and employs a hierarchical Bayesian framework to infer the characteristics of new environments based on knowledge gained from previous environments. hessel2019multi introduce a method to automatically adjust the contribution of each task to the updates of a single agent. This ensures that all tasks exert a similar impact on the learning dynamics. taiga2022investigating investigates multi-task pretraining and generalization in RL. cheng2023multi propose an attention-based multi-task reinforcement learning approach to learn a compositional policy for each task.

4. Resources

In this section, we offer useful tools and resources that can help researchers and practitioners implement MTL models.

Table 8. Summary of common datasets used in MTL.
Dataset Source Year Modality Task Synopsis #Task #Sample Availability
School Data ILEA mortimore1988school Table Regression Predicting student exam scores based on 27 school features. 139 15,362 Official
SARCOS Data Humanoid Robotics 2000 Table Regression Estimate inverse dynamics model. 7 44,484/4449 Official
Computer Survey Data Survey lenk1996hierarchical Table Regression Likelihood of purchasing personal computers. 179 - -
Climate Dataset Sensor network 2017-now Table Regression Real-time climate data collected from four climate stations. 7 - Official
20 Newsgroups Netnews articles Lang95 Text Classification Hierarchical text classification. 20 19,000 Official
Reuters-21578 Collection Reuters 1996 Text Classification Reuters news documents with hierarchical categories. 90 21,578 Official
MultiMNIST Dataset MNIST sabour2017dynamic Image Classification Classify the digits on the different positions. 2 - Official
ImageCLEF-2014 Caltech, ImageNet, Pascal, Bing 2014 Image Classification Benchmark dataset for domain adaptation. 4 2,400 Official
Office-Caltech Dataset Office, Caltech gong2012geodesic Image Classification Benchmark dataset for the annotation and retrieval of images. 4 2,533 Official
Office-31 Dataset Amazon, DSLR, Webcam saenko2010adapting Image Classification Objects commonly encountered in office settings. 3 4,110 Official
Office-Home Dataset Office venkateswara2017deep Image Classification Object recognition and domain adaptation in the era of deep learning. 4 15,588 Official
DomainNet Dataset UDA peng2019moment Image Classification Multi-source unsupervised domain adaptation research 6 600,000 Official
EMMa Dataset Amazon standley2023extensible Image, Text Classification Amazon product listings for category prediction - 2,800,000 Official
SYNTHIA Dataset European Union ros2016synthia Image Classification A synthetic dataset for semantic segmentation. - 13,400 Official
SVHN Dataset Stanford yang2021few Image Classification A digit classification benchmark dataset. - 600,000 Official
CelebA Dataset MMLAB liu2018large Image Classification A large-scale face attributes dataset. 40 200,000 Official
CityScapes Dataset Daimler AG cordts2016cityscapes Image Dense prediction Semantic urban scene understanding - 5,000 Official
NYU-Depth Dataset V2 New York University silberman2012indoor Image Dense prediction Indoor scene understanding with per-pixel labels 3 35,064 Official
PASCAL VOC Project University of Oxford everingham2010pascal Image Dense prediction Object recognition with multiple tasks - - Official
Taskonomy Dataset Standard zamir2018taskonomy Image Dense prediction Diverse dataset with 26 tasks for task transfer learning 26 4,000,000 Official
STREET Amazon ribeiro2023street Text Reasoning The multi-task structured reasoning and explanation benchmark - - -
VKITTI2 Dataset Naver cabon2020virtual Video Segmentation A video dataset which is automatically labeled with ground truth 5 - Official
XTREME Carnegie Mellon hu2020xtreme Text Translation, QA A multilingual benchmark for evaluating cross-lingual generalisation 9 400,000 -
Deepfashion Dataset Shopping Websites liu2016deepfashion Image Classification A large-scale clothes dataset with comprehensive annotations 2 800,000 Official
ACE05 Dataset News 2005 Text Classification A large corpus with annotated entities, relations and events 3 52,615 Official
ATIS Dataset Airline hemphill-etal-1990-atis Text Classification A dataset with 17 unique intent categories. 3 5,871 Official

4.1. Dataset

In this section, we introduce benchmark datasets for MTL from a taxonomic perspective. Specifically, based on the different datasets spawning a series of typical data-driven models, we classify many MTL datasets into three categories: regression task, classification task, and dense prediction task.

4.1.1. Regression task

Synthetic Data. This dataset is often artificially defined by researchers, thus different from one another, e.g. caruana1997multitask; bakker2003task; evgeniou2004regularized; argyriou2008convex; jalali2010dirty; zhou2011clustered; titsias2011spike; zhang2012convex; maurer2013sparse; han2016multi; parra2017spectral; nie2018calibrated; ma2018modeling, to name a few. The features are often generated via drawing random variables from a shared distribution and adding irrelevant variants from other distributions, and the corresponding responses are produced by a specific computational method. In such a manner, data in different tasks would contain both the task-specific and -shared features that contribute to the learning for estimation.

School Data.  mortimore1988school comes from the Inner London Education Authority (ILEA) and contains 15,3621536215,36215 , 362 records of student examination, which are described by 27272727 student- and school-specific features from 139139139139 secondary schools. The goal is to predict exam scores from 27272727 features, and the prediction in 139139139139 schools would be generally handled as 139139139139 tasks.

SARCOS Data.1515152000. SARCOS Data. gaussianprocess.org/gpml/data This dataset is in humanoid robotics consists of 44,4844448444,48444 , 484 training examples and 4,44944494,4494 , 449 test examples. The goal of learning is to estimate the inverse dynamics model of a 7777 degrees-of-freedom (DOF) SARCOS anthropomorphic robot arm, each of which corresponds to a task and contains 21 features—7 joint positions, 7 joint velocities, and 7 joint accelerations. Computer Survey Data. lenk1996hierarchical is from a survey on the likelihood (11-point scale from 0 to 10) of purchasing personal computers. There are 20202020 computer models as examples, each of which contains 13 computer descriptions (e.g., price, CPU speed, and screen size) and 6 subject-level covariates (e.g., gender, computer knowledge, and work experience) as features and ratings of 179179179179 subjects as targets, i.e., tasks. Climate Dataset.1616162017-now. Climate Dataset. www.cambermet.co.uk This real-time dataset is collected from a sensor network (e.g., anemometer, thermistor, and pressure transducer) of four climate stations—Cambermet, Chimet, Sotonmet and Bramblemet—in the south on England, which can represent 4444 tasks as needed. The archived data are reported in 5-minute intervals, including 10similar-toabsent10\sim 10∼ 10 climate signals (e.g., wind speed, wave period, barometric pressure, and water temperature). Generally, air temperature is considered as the dependent variable and others as independent (parra2017spectral; zhao2019multiple).

4.1.2. Classification task

20 Newsgroups.  Lang95 is a collection of approximately 19,0001900019,00019 , 000 netnews articles, organized into 20202020 hierarchical newsgroups according to the topic, such as root categories (e.g., comp, rec, sci, and talk) and sub-categories (e.g., comp.graphics, sci.electronics, and talk.politics.guns). Users can design different combinations as multiple text classifications tasks (he2011graphbased; tan2015transitive; zhang2018multi; mao2020adaptive; xiao2020efficient).

Reuters-21578 Collection.1717171996. Reuters-21578 Collection. www.daviddlewis.com/resources/testcollections/reuters21578/ This text collection contains 21578 documents from Reuters newswire dating back to 1987. These documents were assembled and indexed with more than 90 correlated categories—5 top categories (i.e., exchanges, orgs, people, place, topic), and each of them includes variable sub-categories. Users can independently define the related multiple tasks by choosing different combinations of categories, e.g., zheng2020multi; xiao2021new provide more detailed descriptions.

CelebA Dataset. CelebFaces Attributes Dataset (CelebA) (liu2018large) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including 10,177 identities, 202,599 face images, and 5 landmark locations, 40 binary attribute annotations per image. The dataset can be employed as the training and test sets for the following computer vision tasks: face attribute recognition, face recognition, face detection, landmark (or facial part) localization, and face editing & synthesis. MultiMNIST Dataset. This dataset originated from validating a capsule system (sabour2017dynamic), but it is also a MTL version of MNIST dataset  (lecun1998gradient). By overlaying multiple images together, traditional digit classification is converted to an MTL problem, where classifying the digits in different positions is considered as distinctive task. sener2018multi contributes a standard construction for the research community. ImageCLEF-2014 Dataset.1818182014. ImageCLEF-2014. www.imageclef.org/2014/adaptation This dataset is a benchmark for domain adaptation challenge, which contains 2,40024002,4002 , 400 images of 12 common categories selected from 4 domains: Caltech 256, ImageNet 2012, Pascal VOC 2012, and Bing. These 4 domains are commonly considered as different tasks in MTL.

Office-Caltech Dataset.  gong2012geodesic is a standard benchmark for domain adaption in computer vision, consisting of real-world images of 10 common categories from the Office dataset and Caltech-256 dataset. There are 2,53325332,5332 , 533 images from 4 distinct domains/tasks: Amazon, DSLR, Webcam, and Caltech.

Office-31 Dataset.  saenko2010adapting consists of 4,110 images from 31 object categories across 3 domains/tasks: Amazon, DSLR, and Webcam.

Office-Home Dataset.  venkateswara2017deep is collected for object recognition to validate domain adaptation models in the era of DL, which includes 15,5881558815,58815 , 588 images in office and home settings (e.g., alarm clock, chair, eraser, keyboard, telephone, etc.) organized into 4 domains/tasks: Art (paintings, sketches and artistic depictions), Clipart (clipart images), Product (product images from www.amazon.com), and Real-World (real-world objects captured with a regular camera).

DomainNet Dataset.  peng2019moment is annotated for the purpose of multi-source unsupervised domain adaptation (UDA) research. It contains 0.6similar-toabsent0.6\sim 0.6∼ 0.6 million images from 345 categories across 6 distinct domains, e.g., sketch, infograph, quickdraw, real, etc.

SYNTHIA Dataset.  ros2016synthia is a synthetic dataset created to address the need for a large and diverse collection of images with pixel-level annotations for vision-based semantic segmentation in urban scenarios, particularly for autonomous driving applications. It consists of precise pixel-level semantic annotations for 13 classes, including sky, building, road, sidewalk, fence, vegetation, lane-marking, pole, car, traffic signs, pedestrians, cyclists, and miscellaneous objects.

SVHN Dataset. Street View House Numbers (SVHN) (yang2021few) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits (from 0 to 9) cropped from pictures of house number plates. The cropped images are centered in the digit of interest, but nearby digits and other distractors are kept in the image. SVHN has three sets: training, testing sets and an extra set with 530,000 images that are less difficult and can be used for helping with the training process.

Deepfashion Dataset. DeepFashion (liu2016deepfashion) is a large-scale clothes dataset with comprehensive annotations. It contains over 800,000 images, which are richly annotated with massive attributes, clothing landmarks, and correspondence of images taken under different scenarios including store, street snapshot, and consumer.

ACE05 Dataset.1919192005. ACE05 Dataset. catalog.ldc.upenn.edu/LDC2006T06 The ACE 2005 Multilingual Training Corpus comprises the comprehensive collection of training data in English, Arabic, and Chinese for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus includes diverse data types that have been annotated for entities, relations, and events. The Linguistic Data Consortium (LDC), with support from the ACE Program and additional assistance from LDC, carried out the annotation of this dataset.

ATIS Dataset. The ATIS (Airline Travel Information Systems) dataset  (hemphill-etal-1990-atis) comprises audio recordings along with corresponding manual transcripts of human interactions with automated airline travel inquiry systems. These interactions involve individuals seeking flight-related information. The dataset includes 17 distinct intent categories representing different user intents. In the original data split, the training set contains 4,478 intent-labeled reference utterances, the development set contains 500 utterances, and the test set contains 893 utterances.

4.1.3. Dense prediction task

CityScapes Dataset.  cordts2016cityscapes consists of 5,000 images with high-quality annotations and 20,000 images with coarse annotations from 50 different cities, which contains 19 classes for semantic urban scene understanding. Specifically, pixel-wise semantic and instance segmentation together with ground truth inverse depth labels are often used as three different tasks (kendall2018multi; liu2019end) in MTL. NYU-Depth Dataset V2.  silberman2012indoor is comprised of 1,449 images from 464 indoor scenes across 3 cities, which contains 35,064 distinct objects of 894 different classes. The dense per-pixel labels of class, instance, and depth are used in many computer vision tasks, e.g., semantic segmentation, depth prediction, and surface normal estimation (eigen2015predicting). PASCAL VOC Project. 2020202005. Pascal VOC Project. host.robots.ox.ac.uk/pascal/VOC This project (everingham2010pascal) provides standardized image datasets for object class recognition and also has run challenges evaluating performance on object class recognition from 2005 to 2012, where VOC072121212007. Pascal VOC Challenge 2007. host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html, VOC082222222008. Pascal VOC Challenge 2008. host.robots.ox.ac.uk/pascal/VOC/voc2008/index.html, and VOC122323232012. Pascal VOC Challenge 2012. host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html are commonly used for MTL research. The multiple tasks cover classification, detection (e.g., body part, saliency, semantic edge), segmentation, attribute prediction (farhadi2009describing), surface normals prediction (maninis2019attentive), etc. Many of the annotations are labeled or distilled by the followers (chen2014detect; maninis2019attentive).

Taskonomy Dataset.  zamir2018taskonomy is currently the most diverse product for computer vision in MTL, consisting of 4 million samples from 3D scans of 600similar-toabsent600\sim 600∼ 600 buildings. This product is a dictionary of 26 tasks (e.g., 2D, 2.5D, 3D, semantics, etc.) as a computational taxonomic map for task transfer learning. Accordingly, Tiny-Tasknomy (standley2020tasks) with 5 sampled dense prediction tasks, e.g., semantic segmentation, surface normal prediction, depth prediction, keypoint detection, and edge detection is considered a commonly used benchmark in MTL.

4.1.4. Others

EMMa Dataset. EMMa Dataset (standley2023extensible) comprises more than 2.8 million objects from Amazon product listings, each annotated with images, listing text, mass, price, product ratings, and its position in Amazon’s product-category taxonomy. It includes a comprehensive taxonomy of 182 physical materials, and objects are annotated with one or more materials from this taxonomy. EMMa offers a new benchmark for multi-task learning in computer vision and NLP, allowing for the addition of new tasks and object attributes at scale.

STREET. STREET (ribeiro2023street) is a multi-task benchmark for structured reasoning and explanations in NLP. It consists of five existing datasets (ARC, SCONE, GSM8K, AQUA-RAT, and AR-LSAT) and introduces a unified reasoning formulation with textual logical units and reasoning graphs. Evaluation metrics and empirical performance analysis using T5-large and GPT-3 models are provided, along with error explanations on a per-dataset basis.

VKITTI2 Dataset. Virtual KITTI (gaidon2016virtual) is a new video dataset, automatically labeled with accurate ground truth for object detection, tracking, scene and instance segmentation, depth, and optical flow. Virtual KITTI 2 (cabon2020virtual) is a more photo-realistic and better-featured version of the original virtual KITTI dataset. It exploits recent improvements of the Unity game engine and provides new data such as stereo images or scene flow.

XTREME. The XTREME (Cross-lingual Transfer Evaluation of Multilingual Encoders) (hu2020xtreme) benchmark is a multi-task evaluation framework to assess the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. It highlights the performance disparity between models tested on English, which achieve human-level performance on numerous tasks, and cross-lingually transferred models, which exhibit a significant performance gap, particularly in syntactic and sentence retrieval tasks.

Table 9. Summary of library for MTL.
Library Sprache Supported Methods
RMTL R Sparse structure learning (tibshirani1996regression), multi-task feature selection (obozinski2006multi), low rank MTL (ji2009accelerated; pong2010trace), graph-based regularised MTL (widmer2010leveraging), multi-task clustering (gu2009learning)
MALSAR Matlab Sparse structure learning (tibshirani1996regression), regularized MTL (evgeniou2004regularized), multi-task feature selection (obozinski2006multi), dirty block-sparse model (jalali2010dirty), low rank MTL (ji2009accelerated; pong2010trace), convex ASO (chen2009convex), sparse & low rank MTL (chen2012learning), clustered MTL (zhou2011clustered), robust MTL (chen2011integrating), robust multi-task feature learning (gong2012robust), Temporal group Lasso (zhou2011multi), convex fused sparse group Lasso (zhou2012modeling), incomplete multi-source feature learning (yuan2012multi), multi-stage multi-task feature learning (gong2012multi), multi-task clustering (gu2009learning)
LibMTL Python Cross-stitch (misra2016cross), GradNorm (chen2018gradnorm), Uncertainty Weighting (kendall2018multi), MGDA-MTL (sener2018multi), MMoE (ma2018modeling), MultiNet++ (chennupati2019multinet++), LTB (guo2020learning), MTAN & DWA (liu2019end), PCGrad (yu2020gradient), GradDrop (chen2020just), CGC & PLE (tang2020progressive), IMTL (liu2021towards), GradVac (wang2021gradient), CAGrad (liu2021conflictaverse), DSelect-k (hazimeh2021dselect), RLW & RGW (lin2022reasonable), Nash-MTL (navon2022multi)

4.2. Software Resources

To provide playgrounds for researchers to fairly compare different state-of-the-art algorithms in a unified environment, open-source platforms for MTL merge out. Herein we introduce three popular software resources that aim at variant populations in terms of the implementation languages, algorithm comprehensiveness, downstream task realms, and modularization focuses.

Regularized Multi-Task Learning (RMTL).242424cran.r-project.org/web/packages/RMTL/index.html It is a relatively small yet practical R library for MTL, especially for the ones on biological-related tasks. It includes ten algorithms applicable for regression, classification, joint predictor selection, task clustering, low-rank learning and incorporation of biological networks.

Multi-tAsk Learning via StructurAl Regularization (MALSAR).252525github.com/jiayuzhou/MALSAR It is a MTL package implemented with Matlab. Compared to RMTL, it does not particularly focus on a certain field yet includes more algorithms. In MALSAR, it implements 14 models with 26 of their variations to test their effectiveness.

Library for Multi-Task Learning (LibMTL).262626github.com/median-research-group/LibMTL It is a comprehensive open-source Python library built on PyTorch for MTL. There are 104 MTL models combined by 8 architectures and 13 loss weighting strategies in LibMTL. Moreover, it guarantees unified and consistent evaluations among models on three computer vision datasets. Different from the above packages, LibMTL is well-modularized and supports customization over different components such as loss weighting strategies or architectures.

4.3. Evaluation Metric

4.3.1. Single-task Metric

In this section, we will introduce some single-task metrics that can be used to evaluate the performance of individual tasks in a multi-task learning (MTL) setup.

Regression Task Metric

Root Mean Squared Error (RMSE): RMSE is a commonly used metric to measure the average prediction error in regression tasks. It calculates the square root of the average of squared differences between predicted and true values. RMSE gives higher weights to larger errors, making it sensitive to outliers. It is calculated as:

RMSE=1ni=1n(y~iyi)2𝑅𝑀𝑆𝐸1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript~𝑦𝑖subscript𝑦𝑖2RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(\tilde{y}_{i}-y_{i})^{2}}italic_R italic_M italic_S italic_E = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

where yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the true value, y~isubscript~𝑦𝑖\tilde{y}_{i}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the predicted value, and n𝑛nitalic_n stands for the total number of samples.

Mean Absolute Percentage Error (MAPE): MAPE is a metric used to evaluate the accuracy of predictions in percentage terms. It measures the average percentage difference between predicted and true values. This metric is commonly used in business forecasting tasks. It is calculated as:

MAPE=1ni=1n|y~iyiyi|×100𝑀𝐴𝑃𝐸1𝑛superscriptsubscript𝑖1𝑛subscript~𝑦𝑖subscript𝑦𝑖subscript𝑦𝑖100MAPE=\frac{1}{n}\sum_{i=1}^{n}\left|\frac{\tilde{y}_{i}-y_{i}}{y_{i}}\right|% \times 100italic_M italic_A italic_P italic_E = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | divide start_ARG over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | × 100

Symmetric Mean Absolute Percentage Error (SMAPE): SMAPE is similar to MAPE but has the advantage of being symmetric, meaning it treats overestimations and underestimations equally. It calculates the average percentage difference between predicted and true values, considering the absolute sum of both. It is calculated as:

SMAPE=100ni=1n|y~iyi|(|y~i|+|yi|)/2𝑆𝑀𝐴𝑃𝐸100𝑛superscriptsubscript𝑖1𝑛subscript~𝑦𝑖subscript𝑦𝑖subscript~𝑦𝑖subscript𝑦𝑖2SMAPE=\frac{100}{n}\sum_{i=1}^{n}\frac{\left|\tilde{y}_{i}-y_{i}\right|}{(% \left|\tilde{y}_{i}\right|+\left|y_{i}\right|)/2}italic_S italic_M italic_A italic_P italic_E = divide start_ARG 100 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG | over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG ( | over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ) / 2 end_ARG

Coefficient of Determination R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (R-squared): R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a statistical metric that represents the proportion of variance in the dependent variable (the target) that is predictable from the independent variable (the prediction). It indicates how well the predicted values fit the actual data. It is calculated as:

R2=1i=1n(y~iyi)2i=1n(yiy¯)2superscript𝑅21superscriptsubscript𝑖1𝑛superscriptsubscript~𝑦𝑖subscript𝑦𝑖2superscriptsubscript𝑖1𝑛superscriptsubscript𝑦𝑖¯𝑦2R^{2}=1-\frac{\sum_{i=1}^{n}(\tilde{y}_{i}-y_{i})^{2}}{\sum_{i=1}^{n}(y_{i}-% \bar{y})^{2}}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 - divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

where y¯¯𝑦\bar{y}over¯ start_ARG italic_y end_ARG is the mean of the true values yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Classification Task Metric

Confusion Matrix: A confusion matrix is a table that allows visualization of the performance of a classification model. It presents the number of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) predictions. The confusion matrix is usually represented as follows:

(121) Predicted PositivePredicted NegativeActual PositiveTPFNActual NegativeFPTNmissing-subexpressionmissing-subexpressionPredicted PositivePredicted Negativemissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionActual Positivemissing-subexpressionTPFNActual Negativemissing-subexpressionFPTN\begin{array}[]{cc|cc}&&\text{Predicted Positive}&\text{Predicted Negative}\\ \hline\cr\text{Actual Positive}&&\text{TP}&\text{FN}\\ \text{Actual Negative}&&\text{FP}&\text{TN}\end{array}start_ARRAY start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL Predicted Positive end_CELL start_CELL Predicted Negative end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL Actual Positive end_CELL start_CELL end_CELL start_CELL TP end_CELL start_CELL FN end_CELL end_ROW start_ROW start_CELL Actual Negative end_CELL start_CELL end_CELL start_CELL FP end_CELL start_CELL TN end_CELL end_ROW end_ARRAY

Accuracy: Accuracy is one of the most straightforward classification metrics, representing the proportion of correctly classified instances over the total number of instances in the dataset. It is calculated as:

Accuracy=TP+TNTP+TN+FP+FN𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦TPTNTPTNFPFNAccuracy=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y = divide start_ARG TP + TN end_ARG start_ARG TP + TN + FP + FN end_ARG

Precision: Precision is a metric that measures the proportion of true positive predictions (correctly predicted positive instances) over the total number of positive predictions made by the model. It is calculated as:

Precision=TPTP+FP𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛TPTPFPPrecision=\frac{\text{TP}}{\text{TP}+\text{FP}}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n = divide start_ARG TP end_ARG start_ARG TP + FP end_ARG

Recall (Sensitivity or True Positive Rate - TPR): Recall calculates the proportion of true positive predictions (correctly predicted positive instances) over the total number of actual positive instances in the dataset. It is calculated as:

Recall=TPTP+FN𝑅𝑒𝑐𝑎𝑙𝑙TPTPFNRecall=\frac{\text{TP}}{\text{TP}+\text{FN}}italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG TP end_ARG start_ARG TP + FN end_ARG

F1 Score: The F1 score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is especially useful when there is an uneven class distribution. It is calculated as:

F1_Score=2×Precision×RecallPrecision+Recall𝐹1_𝑆𝑐𝑜𝑟𝑒2PrecisionRecallPrecisionRecallF1\_Score=\frac{2\times\text{Precision}\times\text{Recall}}{\text{Precision}+% \text{Recall}}italic_F 1 _ italic_S italic_c italic_o italic_r italic_e = divide start_ARG 2 × Precision × Recall end_ARG start_ARG Precision + Recall end_ARG

Specificity (True Negative Rate): Specificity measures the proportion of true negative predictions (correctly predicted negative instances) over the total number of actual negative instances in the dataset. It is calculated as:

Specificity=TNTN+FP𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦TNTNFPSpecificity=\frac{\text{TN}}{\text{TN}+\text{FP}}italic_S italic_p italic_e italic_c italic_i italic_f italic_i italic_c italic_i italic_t italic_y = divide start_ARG TN end_ARG start_ARG TN + FP end_ARG

Precision-Recall Curve: The precision-recall curve is a graphical representation of the tradeoff between precision and recall for different classification thresholds. It plots the precision on the y-axis against the recall on the x-axis as the threshold varies.

Area Under the Receiver Operating Characteristic Curve (AUC-ROC): AUC-ROC is a metric that evaluates the performance of a binary classification model across various discrimination thresholds. It represents the area under the ROC curve, where ROC stands for the Receiver Operating Characteristic.

Formula: The AUC-ROC is typically computed using various threshold values to calculate the True Positive Rate (TPR) and False Positive Rate (FPR) at each threshold. The AUC-ROC is then obtained by plotting TPR against FPR and calculating the area under the curve.

Object Detection Task Metric

Bounding Box: In object detection, algorithms typically predict bounding boxes and class labels for objects in an image. A bounding box is represented by a set of four coordinates: (xmin,ymin,xmax,ymax)subscript𝑥𝑚𝑖𝑛subscript𝑦𝑚𝑖𝑛subscript𝑥𝑚𝑎𝑥subscript𝑦𝑚𝑎𝑥(x_{min},y_{min},x_{max},y_{max})( italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ), which define the top-left and bottom-right corners of the box.

Intersection Over Union (IoU): The IoU measures the overlap between the predicted bounding box P𝑃Pitalic_P and the ground truth bounding box G𝐺Gitalic_G. It is defined as:

IoU(P,G)=Area(PG)Area(PG)𝐼𝑜𝑈𝑃𝐺𝐴𝑟𝑒𝑎𝑃𝐺𝐴𝑟𝑒𝑎𝑃𝐺IoU(P,G)=\frac{Area(P\cap G)}{Area(P\cup G)}italic_I italic_o italic_U ( italic_P , italic_G ) = divide start_ARG italic_A italic_r italic_e italic_a ( italic_P ∩ italic_G ) end_ARG start_ARG italic_A italic_r italic_e italic_a ( italic_P ∪ italic_G ) end_ARG

True Positive (TP), False Positive (FP), and False Negative (FN): - A detection is considered a TP if the IoU with the ground truth exceeds a given threshold (typically 0.50.50.50.5) and the class label matches. - A detection is an FP if the IoU is below this threshold, or if there is no corresponding ground truth. - An FN represents a ground truth box which had no detected box surpassing the IoU threshold.

Precision:

Precision=TPTP+FP𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛TPTPFPPrecision=\frac{\text{TP}}{\text{TP}+\text{FP}}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n = divide start_ARG TP end_ARG start_ARG TP + FP end_ARG

Recall:

Recall=TPTP+FN𝑅𝑒𝑐𝑎𝑙𝑙TPTPFNRecall=\frac{\text{TP}}{\text{TP}+\text{FN}}italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG TP end_ARG start_ARG TP + FN end_ARG

Mean Average Precision (mAP): The mAP is a widely-used metric in object detection, averaging the precision values at different recall levels across all classes.

Precision-Recall Curve for Object Detection: This curve plots precision against recall values for different IoU thresholds, offering insights into a detection model’s performance.

Average Recall (AR): AR averages the recall values obtained at various IoU thresholds.

Image Segmentation Metrics

Pixel Accuracy: Pixel accuracy is a simple metric that measures the proportion of pixels that are correctly classified. For a given image or set of images, it is defined as the ratio of correctly classified pixels to the total number of pixels.

PixelAccuracy=Number of correctly classified pixelsTotal number of pixels𝑃𝑖𝑥𝑒𝑙𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦Number of correctly classified pixelsTotal number of pixelsPixelAccuracy=\frac{\text{Number of correctly classified pixels}}{\text{Total % number of pixels}}italic_P italic_i italic_x italic_e italic_l italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y = divide start_ARG Number of correctly classified pixels end_ARG start_ARG Total number of pixels end_ARG

Boundary F1 Score (BF): The Boundary F1 Score evaluates the accuracy of the boundaries in a segmentation task. Given predicted boundaries P𝑃Pitalic_P and ground truth boundaries G𝐺Gitalic_G, the BF score is the F1 score (harmonic mean of precision and recall) calculated based on the detected boundary pixels.

Precision=Number of true positive boundary pixelsNumber of true positive boundary pixels+Number of false positive boundary pixels𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛Number of true positive boundary pixelsNumber of true positive boundary pixelsNumber of false positive boundary pixelsPrecision=\frac{\text{Number of true positive boundary pixels}}{\text{Number % of true positive boundary pixels}+\text{Number of false positive boundary % pixels}}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n = divide start_ARG Number of true positive boundary pixels end_ARG start_ARG Number of true positive boundary pixels + Number of false positive boundary pixels end_ARG
Recall=Number of true positive boundary pixelsNumber of true positive boundary pixels+Number of false negative boundary pixels𝑅𝑒𝑐𝑎𝑙𝑙Number of true positive boundary pixelsNumber of true positive boundary pixelsNumber of false negative boundary pixelsRecall=\frac{\text{Number of true positive boundary pixels}}{\text{Number of % true positive boundary pixels}+\text{Number of false negative boundary pixels}}italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG Number of true positive boundary pixels end_ARG start_ARG Number of true positive boundary pixels + Number of false negative boundary pixels end_ARG
BF=2×Precision×RecallPrecision+Recall𝐵𝐹2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙BF=\frac{2\times Precision\times Recall}{Precision+Recall}italic_B italic_F = divide start_ARG 2 × italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n × italic_R italic_e italic_c italic_a italic_l italic_l end_ARG start_ARG italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n + italic_R italic_e italic_c italic_a italic_l italic_l end_ARG

Panoptic Quality (PQ): The Panoptic Quality metric combines segmentation (things and stuff) and detection (things only) into a single score. It is defined as:

PQ=(pi×ri)Nmatched regions+12×Nfalse positive regions+12×Nfalse negative regions𝑃𝑄subscript𝑝𝑖subscript𝑟𝑖subscript𝑁matched regions12subscript𝑁false positive regions12subscript𝑁false negative regionsPQ=\frac{\sum(p_{i}\times r_{i})}{N_{\text{matched regions}}+\frac{1}{2}\times N% _{\text{false positive regions}}+\frac{1}{2}\times N_{\text{false negative % regions}}}italic_P italic_Q = divide start_ARG ∑ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT matched regions end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG × italic_N start_POSTSUBSCRIPT false positive regions end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG × italic_N start_POSTSUBSCRIPT false negative regions end_POSTSUBSCRIPT end_ARG

Where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the precision and risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the recall for each matched region i𝑖iitalic_i. Nmatched regionssubscript𝑁matched regionsN_{\text{matched regions}}italic_N start_POSTSUBSCRIPT matched regions end_POSTSUBSCRIPT is number of matched regions. Nfalse positive regionssubscript𝑁false positive regionsN_{\text{false positive regions}}italic_N start_POSTSUBSCRIPT false positive regions end_POSTSUBSCRIPT is number of false positive regions. Nfalse negative regionssubscript𝑁false negative regionsN_{\text{false negative regions}}italic_N start_POSTSUBSCRIPT false negative regions end_POSTSUBSCRIPT is number of false negative regions.

Image Generation Metrics

Peak Signal-to-Noise Ratio (PSNR): PSNR is a traditional quality metric used to measure the quality of a reconstructed image compared to an original image. Higher values of PSNR indicate better quality. It is defined as:

PSNR=10×log10(MAXI2MSE)𝑃𝑆𝑁𝑅10subscript10𝑀𝐴superscriptsubscript𝑋𝐼2𝑀𝑆𝐸PSNR=10\times\log_{10}\left(\frac{MAX_{I}^{2}}{MSE}\right)italic_P italic_S italic_N italic_R = 10 × roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG italic_M italic_A italic_X start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_M italic_S italic_E end_ARG )

Where MAXI𝑀𝐴subscript𝑋𝐼MAX_{I}italic_M italic_A italic_X start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is the maximum possible pixel value of the image (often 255255255255 for an 8-bit image), and MSE𝑀𝑆𝐸MSEitalic_M italic_S italic_E is the Mean Squared Error between the original and the reconstructed image.

Structural Similarity Index Measure (SSIM): SSIM measures the structural similarity between two images. It provides a more perceptual-based assessment of image quality than PSNR. A value of 1 indicates the images are identical in terms of structural information.

SSIM(x,y)=(2μxμy+C1)(2σxy+C2)(μx2+μy2+C1)(σx2+σy2+C2)𝑆𝑆𝐼𝑀𝑥𝑦2subscript𝜇𝑥subscript𝜇𝑦subscript𝐶12subscript𝜎𝑥𝑦subscript𝐶2superscriptsubscript𝜇𝑥2superscriptsubscript𝜇𝑦2subscript𝐶1superscriptsubscript𝜎𝑥2superscriptsubscript𝜎𝑦2subscript𝐶2SSIM(x,y)=\frac{(2\mu_{x}\mu_{y}+C_{1})(2\sigma_{xy}+C_{2})}{(\mu_{x}^{2}+\mu_% {y}^{2}+C_{1})(\sigma_{x}^{2}+\sigma_{y}^{2}+C_{2})}italic_S italic_S italic_I italic_M ( italic_x , italic_y ) = divide start_ARG ( 2 italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 2 italic_σ start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG

Where x𝑥xitalic_x and y𝑦yitalic_y are two images, μ𝜇\muitalic_μ represents the mean, σ𝜎\sigmaitalic_σ represents the variance, σxysubscript𝜎𝑥𝑦\sigma_{xy}italic_σ start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT is the covariance of x𝑥xitalic_x and y𝑦yitalic_y, and C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are constants to avoid instability when the denominator is close to zero.

Inception Score (IS): The Inception Score is used to evaluate the quality and diversity of generated images in GANs. A higher IS indicates both better image quality and greater diversity. It’s calculated using a pre-trained Inception model.

IS=exp(Ex[KL(p(y|x)||p(y))])IS=\exp\left(E_{x}[\text{KL}(p(y|x)||p(y))]\right)italic_I italic_S = roman_exp ( italic_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ KL ( italic_p ( italic_y | italic_x ) | | italic_p ( italic_y ) ) ] )

Where x𝑥xitalic_x is an image, y𝑦yitalic_y is the label predicted by the Inception model, and KL𝐾𝐿KLitalic_K italic_L is the Kullback-Leibler divergence.

Fréchet Inception Distance (FID): FID measures the similarity between the generated images and real images. It computes the Fréchet distance between two Gaussians fitted to the feature representations of the Inception network for both sets of images. Lower FID scores indicate that the two sets of images are more similar, implying better generation quality.

FID=μ1μ22+Tr(Σ1+Σ22(Σ1Σ2)0.5)𝐹𝐼𝐷superscriptnormsubscript𝜇1subscript𝜇22TrsubscriptΣ1subscriptΣ22superscriptsubscriptΣ1subscriptΣ20.5FID=||\mu_{1}-\mu_{2}||^{2}+\text{Tr}(\Sigma_{1}+\Sigma_{2}-2(\Sigma_{1}\Sigma% _{2})^{0.5})italic_F italic_I italic_D = | | italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + Tr ( roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 ( roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 0.5 end_POSTSUPERSCRIPT )

Where μ1,Σ1subscript𝜇1subscriptΣ1\mu_{1},\Sigma_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are the mean and covariance of the feature representations for real images and μ2,Σ2subscript𝜇2subscriptΣ2\mu_{2},\Sigma_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are those for generated images.

Text Generation Metrics

BLEU (Bilingual Evaluation Understudy): BLEU is a metric originally designed for machine translation but is also used in text generation. It measures how many n-grams in the generated text match the n-grams in the reference text(s). The score ranges between 0 and 1, with 1 being a perfect match.

BLEU=min(1,length of generated textlength of reference text)×exp(n=1Nwnlogpn)𝐵𝐿𝐸𝑈1length of generated textlength of reference textsuperscriptsubscript𝑛1𝑁subscript𝑤𝑛subscript𝑝𝑛BLEU=\min\left(1,\frac{\text{length of generated text}}{\text{length of % reference text}}\right)\times\exp\left(\sum_{n=1}^{N}w_{n}\log p_{n}\right)italic_B italic_L italic_E italic_U = roman_min ( 1 , divide start_ARG length of generated text end_ARG start_ARG length of reference text end_ARG ) × roman_exp ( ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )

Where wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the weights for each n-gram (typically wn=1Nsubscript𝑤𝑛1𝑁w_{n}=\frac{1}{N}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG), pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the precision of n-grams, and N𝑁Nitalic_N is the maximum n-gram order.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Primarily used for evaluating summary generation, ROUGE measures the overlap between the n-grams in the generated text and the reference text(s).

ROUGEN=sreference summariesn-gramsCountmatch(n-gram)sreference summariesn-gramsCount(n-gram)𝑅𝑂𝑈𝐺𝐸𝑁subscript𝑠reference summariessubscriptn-gram𝑠subscriptCountmatchn-gramsubscript𝑠reference summariessubscriptn-gram𝑠Countn-gramROUGE-N=\frac{\sum_{s\in\text{reference summaries}}\sum_{\text{n-gram}\in s}% \text{Count}_{\text{match}}(\text{n-gram})}{\sum_{s\in\text{reference % summaries}}\sum_{\text{n-gram}\in s}\text{Count}(\text{n-gram})}italic_R italic_O italic_U italic_G italic_E - italic_N = divide start_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ reference summaries end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT n-gram ∈ italic_s end_POSTSUBSCRIPT Count start_POSTSUBSCRIPT match end_POSTSUBSCRIPT ( n-gram ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ reference summaries end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT n-gram ∈ italic_s end_POSTSUBSCRIPT Count ( n-gram ) end_ARG

Where CountmatchsubscriptCountmatch\text{Count}_{\text{match}}Count start_POSTSUBSCRIPT match end_POSTSUBSCRIPT is the number of matching n-grams between the generated text and reference summary, and Count is the number of n-grams in the reference summary.

Perplexity: Used for evaluating language models, perplexity measures how well the probability distribution predicted by the model aligns with the true distribution of the words in the text. Lower perplexity values indicate better model performance.

Perplexity=exp(1Ni=1Nlogp(wi))Perplexity1𝑁superscriptsubscript𝑖1𝑁𝑝subscript𝑤𝑖\text{Perplexity}=\exp\left(-\frac{1}{N}\sum_{i=1}^{N}\log p(w_{i})\right)Perplexity = roman_exp ( - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )

Where N𝑁Nitalic_N is the total number of words, and p(wi)𝑝subscript𝑤𝑖p(w_{i})italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the model’s predicted probability for word wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Self-BLEU: A metric that evaluates the diversity of generated texts. It computes the BLEU score between each generated text and all other generated texts. Lower Self-BLEU scores indicate higher diversity.

Distinct-N: Measures the diversity of generated content by computing the ratio of unique n-grams to the total number of generated n-grams. Higher values of Distinct-N indicate greater diversity.

DistinctN=Number of unique n-gramsTotal number of generated n-grams𝐷𝑖𝑠𝑡𝑖𝑛𝑐𝑡𝑁Number of unique n-gramsTotal number of generated n-gramsDistinct-N=\frac{\text{Number of unique n-grams}}{\text{Total number of % generated n-grams}}italic_D italic_i italic_s italic_t italic_i italic_n italic_c italic_t - italic_N = divide start_ARG Number of unique n-grams end_ARG start_ARG Total number of generated n-grams end_ARG

4.3.2. Multi-task Metric

In this section, we denote by MMTLtsubscriptsuperscript𝑀𝑡𝑀𝑇𝐿M^{t}_{MTL}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_T italic_L end_POSTSUBSCRIPT and MSTLtsubscriptsuperscript𝑀𝑡𝑆𝑇𝐿M^{t}_{STL}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_T italic_L end_POSTSUBSCRIPT the STL measurements of MTL method and STL baseline for the t𝑡titalic_t-th task, respectively. MSTLtsubscriptsuperscript𝑀𝑡𝑆𝑇𝐿absentM^{t}_{STL}\downarrowitalic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_T italic_L end_POSTSUBSCRIPT ↓ indicates that a lower value has better performance for the measurement MSTLtsuperscriptsubscript𝑀𝑆𝑇𝐿𝑡M_{STL}^{t}italic_M start_POSTSUBSCRIPT italic_S italic_T italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and vice versa.

Delta (dong2015multi)

The performance of MTL method can be simply defined as the difference of the STL measurement between the STL baseline and MTL method:

(122) Delta=MMTLtMSTLt,t=1,,T,formulae-sequenceDeltasubscriptsuperscript𝑀𝑡𝑀𝑇𝐿subscriptsuperscript𝑀𝑡𝑆𝑇𝐿𝑡1𝑇\text{Delta}=M^{t}_{MTL}-M^{t}_{STL},t=1,\cdots,T,Delta = italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_T italic_L end_POSTSUBSCRIPT - italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_T italic_L end_POSTSUBSCRIPT , italic_t = 1 , ⋯ , italic_T ,

where M𝑀Mitalic_M was set to be BLEU-4 (papineni2002bleu) in dong2015multi,.

MTL gain (tang2020progressive)

To evaluate the benefit of MTL method over the STL baseline on the t𝑡titalic_t-th task, MTL gain is computed as below:

(123) MTLgain=(1)𝟙{MSTLt}(MMTLtMSTLt),t=1,,T,MTL~{}gain={(-1)}^{\mathds{1}\{M_{STL}^{t}\downarrow\}}(M_{MTL}^{t}-M^{t}_{STL% }),t=1,\cdots,T,italic_M italic_T italic_L italic_g italic_a italic_i italic_n = ( - 1 ) start_POSTSUPERSCRIPT blackboard_1 { italic_M start_POSTSUBSCRIPT italic_S italic_T italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ↓ } end_POSTSUPERSCRIPT ( italic_M start_POSTSUBSCRIPT italic_M italic_T italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_T italic_L end_POSTSUBSCRIPT ) , italic_t = 1 , ⋯ , italic_T ,

which is consistent with any positive or negative measurements (c.f. Delta (dong2015multi)).

ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (maninis2019attentive)

The performance of MTL method can be quantified by calculating the average per-task drop with respect to the single-task baseline using STL measurements:

(124) Δm=1Tt=1T(1)𝟙{MBaselinet}(MMTLtMBaselinet)/MBaselinet,\Delta_{m}=\frac{1}{T}\sum\nolimits_{t=1}^{T}{(-1)}^{\mathds{1}\{M^{t}_{% Baseline}\downarrow\}}(M^{t}_{MTL}-M^{t}_{Baseline})/M^{t}_{Baseline},roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT blackboard_1 { italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT ↓ } end_POSTSUPERSCRIPT ( italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_T italic_L end_POSTSUBSCRIPT - italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT ) / italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT ,
ΔpsubscriptΔ𝑝\Delta_{p}roman_Δ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (lin2022reasonable)

Given that many single tasks can be measured by several metrics, e.g. semantic segmentation measured by mIoU and pixacc, by following ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (maninis2019attentive), the average of the relative improvement over the MTL method on each metric of each task could be formulated as the MTL performance measurement:

(125) Δp=1Tt=1T1Mtm=1Mt(1)𝟙{MBaselinet,m}(MMTLt,mMBaselinet,m)/MBaselinet,m,\Delta_{p}=\frac{1}{T}\sum\nolimits_{t=1}^{T}\frac{1}{M_{t}}\sum\nolimits_{m=1% }^{M_{t}}{(-1)}^{\mathds{1}\{M^{t,m}_{Baseline}\uparrow\}}(M^{t,m}_{MTL}-M^{t,% m}_{Baseline})/M^{t,m}_{Baseline},roman_Δ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT blackboard_1 { italic_M start_POSTSUPERSCRIPT italic_t , italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT ↑ } end_POSTSUPERSCRIPT ( italic_M start_POSTSUPERSCRIPT italic_t , italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_T italic_L end_POSTSUBSCRIPT - italic_M start_POSTSUPERSCRIPT italic_t , italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT ) / italic_M start_POSTSUPERSCRIPT italic_t , italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT ,

where Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the number of metrics used for the t𝑡titalic_t-th task. MBaselinet,msubscriptsuperscript𝑀𝑡𝑚𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒M^{t,m}_{Baseline}italic_M start_POSTSUPERSCRIPT italic_t , italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT denotes the m𝑚mitalic_m-th performance measurement of the baseline method, e.g. the STL or vanilla MTL method, for the t𝑡titalic_t-th task.

5. Discussion

In this section, we will discuss several key questions and explore future directions concerning the theories and applications of MTL.

Multi-Task Pretraining. While MTL has demonstrated its remarkable success in real-world scenarios, delving into its underlying mechanisms becomes even more imperative in the era of PFMs. When data in the wild are pre-trained using scalable foundation models to exhibit modality- and task-agnostic characteristics (§ 2.3), an essential question arises: What proportions of different tasks in the pretraining phase can yield best task-generalizable performance?

Competitive or Collaborative? While many proposed MTL methods offer benefits to each task under their specific settings, competitive tasks continue to exist in real-world scenarios. Distinguishing between them without human priors before employing MTL remains a challenge. Task prior sharing (§ 2.1.5) and task clustering methods (§ 2.1.6) can play a crucial role, as they can help to know task relations and do not conflict with other multi-task representation learning methods.

Blessed or Cursed by Large Number of Tasks? While MTL with a small number of tasks has been proven to outperform STL, and MTL with a large number of tasks has been demonstrated to be learnable, the underlying relationships between these models and the number of tasks raise intriguing questions. The introduction of a new task typically introduces both knowledge and noise to existing tasks. If all tasks are trained equally, (e.g., LLMs), without any selective mechanisms, what are the outcomes for the final learned model concerning each individual task?

MTL for Other Things. The pursuit of performance through MTL has been shown to have potential drawbacks in terms of fairness (§ 3.1), security and privacy (§ 3.2). However, MTL can also contribute to learning fairness or enhancing security and privacy for involved tasks by incorporating novel metrics. In certain situations, a favorable trade-off between these considerations may exist.

Illuminating the Unseen with MTL: To underscore the impactful insights provided by MTL, consider a compelling example where MTL results significantly advanced our understanding of a complex problem. In a medical imaging scenario, MTL was applied to simultaneously predict multiple health-related outcomes, such as disease progression, severity, and patient response to treatment. Unlike STL approaches, MTL unveiled intricate dependencies and interactions between these outcomes, showcasing that certain imaging features played dual roles in influencing multiple health aspects. This holistic perspective allowed researchers to identify subtle correlations and nuanced patterns that were previously obscured by individual task-centric analyses. MTL, in this case, not only improved predictive accuracy but also unraveled hidden intricacies within the data, providing a richer and more comprehensive understanding of the medical conditions under investigation. This example exemplifies how MTL can reveal intricate relationships and enhance interpretability beyond the capabilities of traditional STL methods.

6. Conclusion

In this survey, we introduce the MTL from rough to precise and review methodologies covering traditional ML, DL, and PFMs era. First, we present the background of MTL, covering the scope, formal definition, comparisons with other paradigms, and motivations behind MTL. After that, we explore how MTL works well and provide the reasons to explain its intrinsic mechanisms. We formalize and illustrate MTL in a framework and further expand the methodology overview based on this MTL framework. Specifically, we summarize the sparse structure learning, feature learning, low-rank learning, and decomposition methods in the traditional learning era. We categorize MTL in DL into feature sharing, task balancing, and neural architecture search methods; recent task- and modality-agnostic foundation models are also discussed as they can learn universal comprehensiveness across tasks with different data modalities.

To sum it up, MTL methods in the traditional learning era prefer to "drop" distinctive (task-specific) features to seek consensus. For instance, the classical 2,1subscript21\ell_{2,1}roman_ℓ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm can realize grouped feature selection across tasks to exploit common features that are effective and efficient for joint performance enhancement. Another example is the low-rank learning methods that try to explore common underlying representations via imposing low-dimensional properties for essential factors, where a small set of factors is supposed to govern multiple tasks. However, when it comes to DL models, powerful computational resources make it possible to handle all the features from different tasks, and its hierarchical structure with multiple layers can learn feature interaction across tasks at various levels of abstraction. Accordingly, MTL has been dominated by feature fusing and task-balancing techniques via introducing learnable parameters in the past decade. These learnable parameters play a crucial role in cross-task communication and eavesdropping during the combined training. However, the explanations and mechanisms of these complicated interactions inside the networks still remain poorly understood. More recently, unified foundation models have shown promising results for MTL in real-world scenarios, as data with versatile modalities can be trained simultaneously to learn universal and effective comprehensiveness.

Overall, we hope this paper provides an extensive review of the research community for a comprehensive understanding of research advances, current and future challenges, and opportunities or prospects for the MTL.

Disclosure Statement

The authors have no conflicts of interest to declare.

Acknowledgments

This paper is the result of a collaborative effort, with each author contributing significantly to various aspects:

  • Yutong Dai orchestrated two critical optimizations in MTL, detailed in § 2.2.5 and § 2.2.6.

  • Xiaokang Liu contributed by writing and organizing the section on MTL via low-rank factorization (§ 2.1.3).

  • Jin Huang was responsible for the figure and layout designs, ensuring visual clarity and coherence.

  • Yishan Shen focused on developing the MTL through prior sharing, as outlined in § 2.1.5.

  • Ke Zhang was instrumental in writing and structuring the Graph-based MTL section (§ 2.2.9).

  • Rong Zhou authored the STL metrics section and played a key role in organizing parts of the datasets.

  • Eashan Aahikarla delved deeply into the distribution shifts that occur in MTL (§ 3.3).

  • Wenxuan Ye took charge of organizing the GitHub website for this project, facilitating broader access and collaboration.

  • Yixin Liu was pivotal in developing the security and privacy section for the MTL framework, as detailed in § 3.2.

  • Zhaoming Kong and Kai Zhang were actively involved in discussions about the scope and structure of this survey.

  • Jun Yu initiated this project in 2021 and managed the contents not specifically mentioned above, providing overall leadership and direction.

  • Prof. Moore, Prof. Davison, Prof. Namboodiri and Prof. Yin contributed significantly by offering feedback and suggestions during the paper’s development.

  • Prof. Chen finalizes the paper structure, edited different versions of the manuscript, and tailored the materials towards the audiences of the research community.

All authors above actively participated in the proofreading and discussion stages of this paper. We extend our sincere gratitude to all for their valuable contributions and collective effort in bringing this research to this final version.

\printbibliography