Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras

Abstract.

Multi-Task Learning (MTL) is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to Single-Task Learning (STL), MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL’s key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the past twenty years, MTL has become widely recognized as a flexible and effective approach in various fields, including computer vision, natural language processing, recommendation systems, disease prognosis and diagnosis, and robotics. This survey provides a comprehensive overview of the evolution of MTL, encompassing the technical aspects of cutting-edge methods from traditional approaches to deep learning and the latest trend of pretrained foundation models. Our survey methodically categorizes MTL techniques into five key areas: regularization, relationship learning, feature propagation, optimization, and pre-training. This categorization not only chronologically outlines the development of MTL but also dives into various specialized strategies within each category. Furthermore, the survey reveals how the MTL evolves from handling a fixed set of tasks to embracing a more flexible approach free from task or modality constraints. It explores the concepts of task-promptable and -agnostic training, along with the capacity for zero-shot learning, which unleashes the untapped potential of this historically coveted learning paradigm. Overall, we hope this survey provides the research community with a comprehensive overview of the advancements in MTL from its inception in 1997 to the present in 2023. We address present challenges and look ahead to future possibilities, shedding light on the opportunities and potential avenues for MTL research in a broad manner. This project is publicly available at https://github.com/junfish/Awesome-Multitask-Learning.

Jun Yu\upstairs\affilone\affiltwo,

{\dagger}

{\ddagger}

, Yutong Dai\upstairs\affilthree, Xiaokang Liu\upstairs\affiltwo\affilfour, Jin Huang\upstairs\affilfive, Yishan Shen\upstairs\affiltwo, Ke Zhang\upstairs\affilsix,

Rong Zhou\upstairs\affilone, Eashan Adhikarla\upstairs\affilone, Wenxuan Ye\upstairs\affilone, Yixin Liu\upstairs\affilone, Zhaoming Kong\upstairs\affilseven, Kai Zhang\upstairs\affilone,

Yilong Yin\upstairs\affilfive, Vinod Namboodiri\upstairs\affilone\affileight, Brian D. Davison\upstairs\affilone, Jason H. Moore\upstairs\affilnine, Yong Chen\upstairs\affiltwo,

{\ddagger}

\upstairs\affilone Department of Computer Science and Engineering, Lehigh University, USA

\upstairs\affiltwo Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, USA

\upstairs\affilthree Department of Industrial and Systems Engineering, Lehigh University, USA

\upstairs\affilfour Department of Statistics, University of Missouri, USA

\upstairs\affilfive School of Software, Shandong University, China

\upstairs\affilsix Department of Computer Science, University of Hong Kong, China

\upstairs\affilseven Department of Computer Science and Engineering, South China University of Technology, China

\upstairs\affileight Department of Community and Population Health, Lehigh University, USA

\upstairs\affilnine Department of Computational Biomedicine, Cedars-Sinai Medical Center, USA

\emails\upstairs

${\dagger}$ This work includes efforts as a visiting student at Upenn.

\upstairs

${\ddagger}$ Corresponding to [email protected] oder [email protected].

Refer to caption — Figure 1. Significant landmarks in the evolution of Multi-Task Learning (MTL) highlighted over time.

Keywords: Deep Learning, Generative Pretrained Transformers, Multi-Objective Optimization, Multi-Task Learning, Pretrained Foundation Models, Prompt Learning

\copyrightnotice

1. Introduction

In the introduction, we hope to answer the following five research questions (RQs) before we overview the methodologies of Multi-task Learning (MTL):

•

RQ1: What is the concept and definition of MTL? (See § 1.1)
•

RQ2: How does MTL distinguish itself from other learning paradigms? (See § 1.2)
•

RQ3: What motivates the use of MTL in learning scenarios? (See § 1.3)
•

RQ4: What underlying principles does the efficacy of MTL rest on? (See § 1.4)
•

RQ5: In what ways does our survey differentiate from previous studies? (See § 1.5)

In § 1.1, we progressively introduce Multi-Task Learning (MTL), starting with a broad sense and culminating in a formal definition. Subsequently, § 1.2 explores the position of MTL within the Machine Learning (ML) landscape, drawing comparisons with related paradigms such as Transfer Learning (TL), Few-Shot Learning (FSL), lifelong learning, Multi-View Learning (MVL), to name a few. § 1.3 delves into the motivations for employing MTL, offering insights from both explicit and subtle angles, while also addressing how MTL benefits the involved tasks. In § 1.4, we delve deeper into the fundamental mechanisms and theories underpinning MTL, specifically: 1) regularization, 2) inductive bias, and 3) feature sharing, providing an understanding of its underlying principles. Finally, § 1.5 reviews existing surveys on MTL, underscoring the unique contributions of our survey and laying out a structured roadmap for the remainder of this work. The structure of our survey is depicted in Fig. 2. Before delving into this survey, readers can quickly refer to Table 1 for a list of acronyms not related to datasets, institutions, and newly proposed methods, while an overview of mathematical notations is provided in Table 3 and Table 6.

Table 1. Alphabetically sorted index table of acronyms.

Abbreviation	Expanded Form	Abbreviation	Expanded Form
AD	Alzheimer’s Disease	AGM	Accelerated Gradient Method
APM	Accelerated Proximal Method	CE	Cross-Entropy
CNN	Convolutional Neural Network	CT	Computed Tomography
CV	Computer Vision	DA	Domain Adaptation
DL	Deep Learning	DNN	Deep Neural Network
FCN	Fully Convolutional Network	FNN	Feedforward Neural Network
FSL	Few Shot Learning	GAN	Generative Adversarial Network
GCN	Graph Convolutional Network	GNN	Graph Neural Network
GP	Gaussian Process	GPT	Generative Pretrained Transformer
GPU	Graphics Processing Unit	GRL	Gradient Reversal Layer
I/O	Input/Output	KD	Knowledge Distillation
LLM	Large Language Model	LSTM	Long Short-Term Memory
MAP	Maximum A Posteriori	MCI	Mild Cognitive Impairment
MDP	Markov Decision Process	MIM	Masked Image Modeling
MIML	Multi-Instance Multi-Label learning	MIMO	Multi-Input Multi-Output
MISO	Multi-Input Single-Output	ML	Machine Learning
MLM	Masked Language Modeling	MLP	Multi-Layer Perceptron
MoE	Mixture-of-Experts	MOO	Multi-Objective Optimization
MRI	Magnetic Resonance Imaging	MSE	Mean Squared Error
MTL	Multi-Task Learning	MTRL	Multi-Task Reinforcement Learning
MVL	Multi-View Learning	NAS	Neural Architecture Search
NLI	Natural Language Inference	NLP	Natural Language Processing
OCR	Optical Character Recognition	OOD	Out-Of-Distribution
PET	Positron Emission Tomography	PFM	Pretrained Foundation Model
PSD	Positive Semi-Definite	RL	Reinforcement Learning
RNN	Recurrent Neural Network	seq2seq	sequence to sequence
SIMO	Single-Input Multi-Output	SNP	Single Nucleotide Polymorphism
SGD	Stochastic Gradient Descent	SSL	Self-Supervised Learning
SOTA	State-Of-The-Art	STL	Single-Task Learning
SVD	Singular Value Decomposition	SVM	Support Vector Machine
TL	Transfer Learning	TPU	Tensor Processing Unit
VLM	Vision-Language Model	VQA	Visual Question Answering
ZSL	Zero-Shot Learning

This table excludes abbreviations pertaining to datasets, institutions, and newly proposed methods.

1.1. Definition

The increasing popularity of MTL over the past few decades is evident in Fig. 3, which displays the trend in the number of papers associated with “allintitle: ‘multitask learning’ OR ‘multi-task learning’ ” as a keyword search, according to data from Google Scholar¹¹1https://scholar.google.com.

As the name suggests, MTL is a subfield of ML where multiple tasks are jointly learned. In this manner, we hope to leverage useful information across these related tasks and break from the tradition of performing different tasks in isolation. In Single-Task Learning (STL), data specific to the task at hand is the only source to couch a learner. However, MTL can conveniently transfer extra knowledge learned from other tasks. The essence of MTL is to exploit consensual and complementary information among tasks by combining data resources and sharing knowledge. This sheds light on a better learning paradigm that can reduce memory burden and data consumption, and improve training speed and testing performance. For instance, learning the monocular depth estimation (scaling the distance to the camera) (eigen2014depth) and semantic segmentation (assigning a class label to every pixel value) (fu1981survey) simultaneously in images is beneficial since both tasks need to perceive meaningful objects. MTL has become increasingly ubiquitous as experimental and theoretical analyses continue to validate its promising results. For example, using Face ID to unlock an iPhone is a typical but imperceptible MTL application that involves simultaneously locating the user’s face and identifying the user. In general, multitasking occurs when we attempt to handle two or more objectives during the optimization stage in practice.

Consequently, MTL exists everywhere in ML, even when performing STL with regularization. This can be understood as having one target task and an additional artificial task of human preference, such as learning a constrained model via $\ell_{2}$ regularizer or a parsimonious model via $\ell_{1}$ regularizer. These hypothesis preferences can serve as an inductive bias to enhance an inductive learner (caruna1993multitask). In the early exploration of MTL (caruana1997multitask), the extra information that the involved tasks provide is regarded as a domain-specific inductive bias for the other tasks. Since collecting training signals from other tasks is more practical than acquiring inductive bias from model design or human expertise, we can thus empower any ML models via this MTL paradigm.

1.1.1. Formal Definition

To comprehensively understand MTL, we provide a formal definition of MTL. Suppose we have a sample dataset $\boldsymbol{X}$ drawn from the feature space ${\mathcal{X}}$ , and its respective ground-truth label set $\boldsymbol{Y}$ drawn from the label space ${\mathcal{Y}}$ . We can define experience ${\mathcal{E}}\subseteq\{\boldsymbol{X},\boldsymbol{Y}\}$ , domain ${\mathcal{D}}=({\mathcal{X}},P(\boldsymbol{X}))$ , and task ${\mathcal{T}}=({\mathcal{Y}},f)$ , where $P(\boldsymbol{X})$ is the distribution of $\boldsymbol{X}$ and $f$ maps a data sample $\boldsymbol{x}\in\boldsymbol{X}$ to a prediction $\tilde{\boldsymbol{y}}\in\boldsymbol{Y}$ . These predictive values consist of the predictive label set $\tilde{\boldsymbol{Y}}=\{\tilde{\boldsymbol{y}}|\tilde{\boldsymbol{y}}=f(% \boldsymbol{x}),\boldsymbol{x}\in\boldsymbol{X}\}$ . Following the ML settings, we should define a measurement ${\mathcal{P}}=(\boldsymbol{Y},\tilde{\boldsymbol{Y}},{\mathcal{L}})$ , where ${\mathcal{L}}$ is a function to measure the distance between any pairs of $(\boldsymbol{y},\tilde{\boldsymbol{y}})$ . More basic notations please refer to Table 3. Based on the definitions of four basic elements (experience, domain, task, and measurement) above, we first restate the general definition of machine learning by mitchell1997machine to a more exact form as follows.

Definition 1 (Machine Learning, mitchell1997machine).

A computer program is said to learn from experience ${\mathcal{E}}$ with respect to a set of tasks $\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}$ and performance measurement ${\mathcal{P}}$ , if its performance at tasks $\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}$ , as measured by ${\mathcal{P}}$ , improves with experience ${\mathcal{E}}$ .

The definition above inherently considers both single-task and multi-task scenarios during the ML process but deviates from a meticulous definition to characterize MTL that includes recent developments. Now, let us first define STL to induce the formal definition of MTL.

Definition 2 (Single-Task Learning).

A type of machine learning specified by ${\mathcal{E}},\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}$ and ${\mathcal{P}}$ , where $\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}$ contains only one task (i.e. $T=1$ ) on a specific domain ${\mathcal{D}}$ .

As recent developments in MTL focus more on heterogeneous tasks (e.g., regression $+$ classification) than homogeneous ones, each task should be represented by its own experience ${\mathcal{E}}$ on its corresponding domain ${\mathcal{D}}$ . Due to this diversity, we always employ distinct measurement ${\mathcal{P}}$ to evaluate the learning performance of each task. We accordingly define the MTL as follows.

Definition 3 (Multi-Task Learning).

A super set of STL specified by $\bigcup_{t=1}^{T}{\mathcal{E}}^{(t)},\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}$ and $\{{\mathcal{P}}^{(t)}\}_{t=1}^{T}$ , where experience ${\mathcal{E}}^{(t)}\subseteq\{\boldsymbol{X}^{(t)},\boldsymbol{Y}^{(t)}\}$ is with respect to task ${\mathcal{T}}^{(t)}$ on its corresponding domain ${\mathcal{D}}^{(t)}$ . Accordingly, MTL is a computer program to learn from the experience set $\bigcup_{t=1}^{T}{\mathcal{E}}^{(t)}$ with respect to the task set $\{{\mathcal{T}}^{(t)}\}_{t=1}^{T}$ and the corresponding performance measurement set $\{{\mathcal{P}}^{(t)}\}_{t=1}^{T}$ , if its total performance at any task ${\mathcal{T}}^{(t)}$ , as measured by its corresponding ${\mathcal{P}}^{(t)}$ , $t=1,\cdots,T$ , improves with experience set $\bigcup_{t=1}^{T}{\mathcal{E}}^{(t)}$ .

We note that the formal MTL definition above has no conflict with the homogeneous or heterogeneous MTL.

1.2. Related Fields

Having established a formal definition of MTL grounded in fundamental ML elements, a thorough understanding can be achieved by analytically comparing it with related domains. These include Transfer Learning (TL), Meta-Learning, and In-Context Learning (ICL), among others. This comparison not only clarifies the distinct characteristics of MTL but also situates it within the broader context of these interconnected fields.

Transfer Learning (TL)

TL (pan2009survey) is a prevalent learning paradigm that solves the problem of lacking labeled data when applying ML to real-world data (zhuang2020comprehensive; pan2009survey). Specifically, TL improves the performance of a target model on target domains by transferring the knowledge in different but related source domains to the target domains. Such properties make TL well-appreciated in real-world applications, such as healthcare (kao2021toward; song2021transfer; perez2021transfer) and recommender systems (tl_recom_www21; liu2021leveraging; tl_recom_cikm21). According to the availability of labels in the source and target domains, TL is categorized into three types, i.e., transductive TL (aka Domain Adaptation (DA), redko2019advances; patel2015visual), inductive TL, and unsupervised TL (zhuang2020comprehensive; pan2009survey).

Few-Shot Learning (FSL)

FSL (fink2004object; fei2006one; wang2020generalizing) is a specific application case of TL. It aims at obtaining a model for the target task under a certain scenario where limited labeled samples from the target domain are available (wang2020generalizing). FSL is well-acknowledged in tackling different real-world problems such as identifying atypical ailments (quellec2020automatic; jia2020few), visual navigation (al2022zero; luo2021few), and cold-start item recommendation (sun2021mfnp; zhang2021model).

Meta-Learning

Meta-Learning (hospedales2021meta) is an implementation approach to achieve TL. The main concept is to obtain a meta-learner (a model) that can have satisfying performance for an unseen target domain (hospedales2021meta). Such meta-learner first extracts the meta-knowledge, i.e., the universally applicable principles, across source domains. With meta-knowledge, the meta-learner can be easily generalized to the target domain by leveraging the target samples. Meta-learning has been successfully applied in various problems such as hyper-parameter optimization (bohdal2021evograd; raghu2021meta), algorithm selection for data mining (simchowitz2021bayesian), and neural architecture search (NAS) (lee2021hardware; ding2022learning).

Though TL paradigms, including FSL and meta-learning, involve multi-domain data, their ultimate goal is to obtain a model with satisfied performance or can be easily generalized to one target task. In other words, TL leverages the knowledge in different tasks to assist the model in learning a single task, which intersects with MTL according to our definition in Definition 3. Thus, TL can bring merits to MTL, such as capturing the relations among tasks and extracting shared knowledge among involved tasks. Notably, the transfer of knowledge from pretrained foundation models (PFMs) proves beneficial for a myriad of downstream tasks in recent advancements (bommasani2021opportunities; zhou2023comprehensive).

Lifelong Learning

Lifelong Learning (parisi2019continual), aka Continual Learning, Sequential Learning, or Incremental Learning, studies the problem of learning from an infinite stream of data (de2021continual). The goal is to gradually extend the acquired knowledge and use it for future data, mitigating the occurrence of catastrophic forgetting or interference (mcclelland1995there). With only a small portion of the input data from one or few tasks available at once, lifelong learning particularly tends to preserve the knowledge learned from the previous input when learning on new data, i.e., addressing the stability-plasticity dilemma (grossberg2012studies). There are extensive applications of lifelong learning in solving tasks in ever-evolving systems, such as recommendations (chen2021towards; yao2021device) and anomaly detection (peng2021lime; doshi2022rethinking). Lifelong learning differs from MTL in the sense that its training object is a dynamic data stream, while MTL studies data from multiple tasks available at the beginning of the learning process.

Multi-View Learning (MVL)

MVL (xu2013survey; zhao2017multi; li2018survey) studies the problem of jointly learning from multi-view data samples, whose goal is to optimize the generalization performance for the jointly learning model (li2018survey). In real-world applications, the multi-view data indicates objects being described by multi-modal measurements, such as image+text, audio+video, and audio+articulation. Multi-Instance Multi-Label learning (MIML) (zhou2012multi) is a specific subtype of MVL, where an example is described by multiple instances and associated with multiple class labels. Due to the vast existence of multi-view data in realistic, MVL has attracted much attention in both research and industry, and the respective solutions play essential roles in cross-media retrieval (zhen2019deep; huang2020forward), video analysis (wang2022cascade; zellers2021merlot), recommender system (wei2022contrastive; chai2022knowledge), etc. MVL, including MIML, can be considered a specialized form of MTL, where the input contains data from multiple domains that are handled as distinct tasks, but the output is still in one label space.

In-Context Learning (ICL)

ICL (dong2022survey) has aroused interest as a novel learning paradigm for natural language processing (NLP) within Large Language Models (LLMs). ICL relies on templates in natural language that can demonstrate different tasks, such as solving mathematical reasoning problems (wei2022chain) and learning natural language inference (NLI) (liu2021natural). LLMs can then make predictions by taking this demonstration and its corresponding query pair as input. While both ICL and MTL involve leveraging shared knowledge or context to enhance task generalizability, ICL is specifically tailored to the target task within a narrower scope in real-world applications. However, recent large PFMs, like GPT-4 (openai2023gpt4), are inherently task-agnostic, accommodating various tasks owing to the diversity of demonstration templates encountered during their large-scale training stage.

1.3. Motivation and Benefit

MTL can be motivated from the following five perspectives with different benefits: cognitive/social psychology, data augmentation, learning efficacy, real-world scenarios, and learning theory.

•

Psychologically, humans are inherent with flexible adaptability to new problems and settings, as the human learning process can transfer knowledge from one experience to another (national2000people). Therefore, MTL is inspired by simulating this process to empower a model with the potentiality of multitasking. Coincidentally, another example of this knowledge transfer happens among organizations (argote2000knowledge). It is proved that organizations with more effective knowledge transfer are more productive and likely to survive than those with less. These prior successes of transfers or mutualizations in other areas encourage the joint learning of tasks in ML (caruana1997multitask).
•

In the pre-big data era, real-world problems were usually represented by small but high-dimensional datasets ( $\#$ samples $<\#$ features). This data bottleneck forces early methods to learn a sparse-structured model, which always leads to a parsimonious solution to a problem with insufficient data. However, the MTL emerged to aggregate labeled data from different domains or tasks to enlarge the training dataset against overfitting.
•

The pursuit of efficiency and effectiveness is also one of the motivations. MTL can aggregate data from different sources together, and the joint training process of multiple tasks can save both computation and storage resources. In addition, the potential of performance enhancement makes it popular in research communities. In brief, universal representations for any tasks can be learned from multi-source data, and benefit all tasks in terms of both the learning cost and performance.
•

Motivated by the majority of real-world problems naturally being multimodal or multitasking, MTL is proposed to remedy the suboptimal achieved by STL that only models parts of the whole problem separately. For example, predicting the progression of Alzheimer’s Disease (AD) biomarkers for Mild Cognitive Impairment (MCI) risk and clinical diagnosis is simultaneously based on multimodal data such as computed tomography (CT), Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) (jie2015manifold; kwak2018multi; chen2022machine). Autonomous driving, another example, also involves multiple subtasks to calculate the final prediction (yang2018end; chowdhuri2019multinet), including the recognition of surrounding objects, adjustments to the fastest route according to the traffic conditions, the balance between efficiency and safety, etc.
•

From the perspective of learning theory, bias-free learning is proved to be impossible (mitchell1980need), so we can motivate the MTL by using the extra training signals for related tasks. Generally, MTL is one of the ways to achieve inductive transfer via multitasking assistance, which improves both learning speed and generalization. Specifically, during the process of the combined training of multiple tasks, some tasks can be provided inductive bias from other related tasks, and these stronger inductive biases (compared with universal regularizers, e.g., $\ell_{2}$ ) enable the knowledge transfer and yield more generalization abilities on a fixed training dataset. In other words, task-related biases make a learner prefer hypotheses that can explain more than one task and prevent specific task from overfitting.

1.4. Mechanism and Explanation

In this section, we explore three key mechanisms – regularization, inductive bias, and feature sharing – shedding light on how MTL operates to achieve enhanced performance across multiple tasks.

Regularization

In MTL, the total loss function is a combination of multiple loss terms with respect to each task. The related tasks play a role as regularizers, enhancing the generalizability across them. The hypothesis space of an MTL model is confined to a more limited scope as it tackles multiple tasks simultaneously. Consequently, this constraint on the hypothesis space reduces model complexity, mitigating the risk of overfitting.

Inductive Bias

The training signals from co-training tasks act as mutual inductive biases due to their shared domain information. These biases facilitate cross-task knowledge transfer during training, guiding the model to favor task-related concepts rather than the tasks themselves. Consequently, this broadens the model’s horizons beyond a singular task, enhancing its generalization capabilities for unseen out-of-distribution (OOD) data.

Feature Sharing

MTL can enable feature sharing across related tasks. One approach involves selecting overlapping features and maximizing their utility across all tasks. This is referred to as “eavesdropping” (ruder2017overview), considering that some features may be unavailable for specific tasks but can be substituted by that learned from related tasks. Another way is to concatenate all the features extracted by different tasks together; these features can be holistically used across tasks via linear combination or nonlinear transformation.

Overall, MTL can be an efficient and effective way to boost the performance of the ML model on multiple tasks by regularization, inductive transfer, and feature sharing.

1.5. Contributions and Highlights

Existing Surveys. ruder2017overview is a pioneering survey in MTL, offering a broad overview of MTL and focusing on advances in deep neural networks from 2015 to 2017. thung2018brief reviews MTL methods from a taxonomy perspective of input-output variants, mainly concentrating on traditional MTL prior to 2016. These two reviews can be complementary materials to each other. vafaeikia2020brief is an incomplete survey that briefly reviews recent deep MTL approaches, particularly focusing on the selection of auxiliary tasks for enhanced learning performance. crawshaw2020multi presents the well-established and advanced MTL methods before 2020 from the perspective of applications. vandenhende2021multi provides a comprehensive review of deep MTL in dense prediction tasks, which generate pixel-level predictions such as in semantic segmentation and monocular depth estimation. zhang2021survey first give a comprehensive overview of MTL models from the taxonomy of feature-based and parameter-based approaches, but with limited inclusion of deep learning (DL) methods. Notably, all these surveys overlook the development of MTL in the last three or four years, named the era of large PFMs (bommasani2021opportunities; zhou2023comprehensive), exemplified by the GPT-series models (radford2018improving; radford2019language; brown2020language; openai2023gpt4).

Roadmap. This survey adopts a well-organized structure, distinguishing it from its predecessors, to demonstrate the evolutionary journey of MTL from traditional methods to DL and the innovative paradigm shift introduced by PFMs, as shown in Fig. 1. In § 2.1, we provide a comprehensive summary of traditional MTL techniques, including feature selection, feature transformation, decomposition, low-rank factorization, priori sharing, and task clustering. Moving forward, § 2.2 is devoted to exploring the critical dimensions of deep MTL methodologies, encompassing feature fusion, cascading, knowledge distillation, cross-task attention, scalarization, multi-objective optimization (MOO), adversarial training, Mixture-of-Experts (MoE), graph-based methods, and NAS. The recent advancements in PFMs are introduced in § 2.3, categorized based on task-generalizable fine-tuning, task promptable engineering, as well as task-agnostic unification. Additionally, we provide a concise overview of the miscellaneous aspects of MTL in § 3. § 4 provides valuable resources and tools to enhance the engagement of researchers and practitioners with MTL. Our discussions and future directions are presented in § 5, followed by our conclusion in § 6. The goal of this review is threefold: 1) to provide a comprehensive understanding of MTL for newcomers; 2) to function as a toolbox or handbook for engineering practitioners; and 3) to inspire experts by providing insights into the future directions and potentials of MTL.

2. MTL Models

Formalization

In machine learning, no matter the problem (discriminative, generative, adversarial, etc.), we hope to learn a predictive model by minimizing the regularized empirical loss as

(1)

\min\limits_{\boldsymbol{W}}{\mathcal{L}}(f_{\boldsymbol{W}}(\boldsymbol{X}),% \boldsymbol{Y})+\lambda\Omega(\boldsymbol{W}),

where $(\boldsymbol{X},\boldsymbol{Y})$ is data pairs sampled from a single task, and $\boldsymbol{W}$ includes weights of learning model $f(\cdot)$ . In general, ${\mathcal{L}}$ measures the distance between the predictions and ground-truth, and $\Omega$ adds constraints to the learning model, e.g., sparsity. The trade-off parameter $\lambda$ controls the balance between the loss and penalty. Fig. 4(a) shows the detailed framework of STL. In comparison, as shown in Fig. 4(b), the optimization in MTL is conducted on the multiple loss functions to achieve joint learning, and each task can maintain a specific loss function. Accordingly, MTL considers the problem in the following:

(2)

\min\limits_{\{\boldsymbol{W}^{(t)}\}_{t=1}^{T}}\sum_{t=1}^{T}{\mathcal{L}}^{(% t)}\left(f_{\boldsymbol{W}^{(t)}}(\boldsymbol{X}^{(t)}),\boldsymbol{Y}^{(t)}% \right)+\lambda\Omega\left(\boldsymbol{W}^{(1)},\cdots,\boldsymbol{W}^{(T)}% \right),

where $T$ denotes the number of tasks, and $f(\cdot)$ is the MTL model to be learned. In MTL, $f(\cdot)$ always encodes both task-specific and -shared representations, and $\Omega(\cdot)$ builds task relatedness and reciprocity; both contribute to the effectiveness and efficiency of MTL.

I/O Configurations

To accommodate data in Eq. (2), it is necessary to consider various input/output (I/O) configurations that may impose constraints on the MTL modeling process. For instance, tasks such as semantic segmentation and depth estimation can utilize the same input images, and the applications are always developed using datasets where each image is attached with dense prediction labels for both segmentation and depth. On the other hand, when dealing with a digital recognition problem involving multiple domains (e.g., handwritten digits and license plate digits), different inputs are mapped to the same output space. We refer the former as a single-input multi-output (SIMO) configuration and the latter as a multi-input single-output (MISO) configuration. In MTL, the most prevalent scenarios reside in multi-input multi-output (MIMO) configuration where each task maintains its own set of samples and the labels are omnivorous, e.g., autonomous driving that involves pedestrian detection and traffic sign recognition. Let us denote the data input space and its corresponding label space for the $t$ -th task $(t=1,\cdots,T)$ by $\mathcal{X}^{(t)}$ and $\mathcal{Y}^{(t)}$ , respectively. We classify the MTL problems into three cases: SIMO, MISO, and MIMO. Fig. 5 shows the illustration of these three configurations. It is worth noting that the I/O configurations do not significantly impact the taxonomy of methods in MTL. As indicated in Table 2, there are numerous shared practices of applying different methods to these I/O configurations, as well as various data modalities and task types.

Table 2. Summary of MTL methods discussed in § 2.

I/O

Data Modality

Task Type

MTL Strategy

Assumption

SIMO

MISO

MIMO

Table

Image

Text

Graph

Regression

Classification

Dense Prediction

Feature Selection

✓

✗

Decomposition

✓

Regularization

Low-Rank Factorization

✓

Priori Sharing

✓

Task Clustering/Grouping

✓

Group-Based Learning

✓

✗

✓

Relationship Learning

Mixture-of-Experts

✓

Feature Fusion

✓

✗

✓

Cascading

✓

✗

✓

Knowledge Distillation

✓

✗

✓

Feature Propagation

Cross-Task Attention

✓

✗

✓

Scalarization

✓

Multi-Objective Optimization

✓

Adversarial Training

✓

Optimization

Neural Architecture Search

✓

Downstream Fine-tuning

✓

✗

✓

Task Prompting

✓

Pre-training

Multi-Modal Unification

✓

✓ indicates common practice in the research community. ✗ indicates not applicable due to technical constraints.

Taxonomy

MTL has seen significant advancement prior to the DL era (caruna1993multitask; caruana1997multitask; bakker2003task; ando2005framework; obozinski2006multi; zhang2006a). Initially, there was a strong focus on weight/parameter regularization, including sparse learning for cross-task feature selection, low-rank learning to uncover underlying factors, and decomposition methods to capture informative components. These approaches, while innovative in integrating intuitive variations from existing methods (e.g., the $\ell_{2,1}$ regularizer derived from the classic $\ell_{1}$ regularizer), still face limitations in practical applications due to the idealistic assumptions and a lack of consideration for task relationships. The emergence of methods like task clustering, priori sharing, graph-based learning, and MoE marked a shift towards more effective task relationship modeling. With the transition to the DL era, the abundance of features learned from architectures like convolutional neural networks (CNNs) (fukushima1980neocognitron; lecun1998gradient), recurrent neural networks (RNNs) (werbos1988generalization; hochreiter1997long) and Transformers (vaswani2017attention; dosovitskiy2020image) spurred the exploration of feature propagation methods, such as feature fusion, cascading, knowledge distillation (KD), and cross-task attention, all crucial for leveraging multi-source features. Alternatively, optimization-based methods, including scalarization, MOO, adversarial training and NAS, focused on gradients to harmonize optimization directions across tasks. These methods, while not restricted by I/O configurations, are constrainted on the pre-defined number of tasks and the use of heterogeneous architectures. Pre-training techniques, which leverages TL, markes a significant advancement towards unified and versatile multitasking, breaking limitations related to data modalities, dimensions, task numbers, model architectures, etc. The only cost is the large computation resources to train a really large model that can accommodate multi-task distributions. The MTL models are accordingly organized into five categories: regularization, relationship learning, feature propagation, optimization, and pre-training. Each contains a series of topics arranged chronologically in § 2.1 (traditional ML era), § 2.2 (DL era), and § 2.3 (PFM era). All of these topics can be inferred from three self-evident assumptions (but have been extensively validated by empirical evidence) as below:

Assumption 1 (Parameter Relatedness).

Under the same hypothesis space, models learned to perform related tasks can exhibit similarities.

Assumption 2 (Feature Richness).

Given the same level of experience, expanding the number of tasks to be learned can enhance the richness of features.

Assumption 3 (Optimization Consistency).

Learning multiple related tasks jointly in a single model can ensure consistency in optimization directions for each task.

We acknowledge that the presented taxonomy is not exhaustive, and certain methods may be classified differently when viewed from a different perspective. For example, Task Tree (TAT) (han2015learning), a clustering MTL method, establishes task hierarchy by decomposing the parameter matrix into different component matrices for each tree layer; we discuss it within the context of clustering MTL (see § 2.1.6). We also acknowledge that some methods that may be of interest to readers may not be included in this survey due to similarities or oversight. We welcome paper recommendations and will update the survey on our project page accordingly.²²2https://github.com/junfish/Awesome-Multitask-Learning. In Table 2, we summarize their assumptions, common practice, and technical constraints of these topics in terms of I/O configuration, data modality, and task type.

2.1. Traditional Era: Provable but Restrictive

Table 3. Summary of basic notations used in this paper.

Notation	Description
$n,N\in\mathbb{R}$	Scalars are denoted by plain lowercase or uppercase letters.
#object	The number of object, e.g. #task denoting the number of task.
$\boldsymbol{x}$ oder $\vec{\boldsymbol{x}}\in\mathbb{R}^{N}$	A vector $\boldsymbol{x}$ with $N$ entries, denoted by bold lowercase letters.
$\boldsymbol{X}\in\mathbb{R}^{M\times N}$	A matrix $\boldsymbol{X}$ with size $M\times N$ , denoted by bold uppercase letters.
$\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}}$	A tensor $\boldsymbol{\mathcal{X}}$ with size $\mathbb{R}^{I_{1}\times\cdots\times I_{N}}$ , denoted by bold calligraphic letters.
$\{\star^{(i)}\}_{i=1}^{N}$	A set contains $\star^{(1)},\cdots,\star^{(N)}$ , where $\star$ could be anything, e.g., scalar, vector, data pair, learner, etc.
$x_{n}\in\mathbb{R}$	The $n$ -th entry for vector $\boldsymbol{x}\in\mathbb{R}^{N},n\in\{1,2,\cdots,N\}$ .
$x_{m,n}$ oder $[\boldsymbol{X}]_{m,n}\in\mathbb{R}$	The $(m,n)$ -th entry of matrix $\boldsymbol{X}\in\mathbb{R}^{M\times N},m\in\{1,2,\cdots,M\},n\in\{1,2,\cdots,N\}$ .
$\boldsymbol{X}\odot\boldsymbol{Y}\in\mathbb{R}^{M\times N}$	Element-wise product of $\boldsymbol{X}\in\mathbb{R}^{M\times N}$ and $\boldsymbol{Y}\in\mathbb{R}^{M\times N}$ , which means the $(m,n)$ -th entry of $\boldsymbol{X}\odot\boldsymbol{Y}$ is $x_{m,n}y_{m,n}$ .
$\boldsymbol{x}^{n}\in\mathbb{R}^{M}$	The $n$ -th column vector of matrix $\boldsymbol{X}\in\mathbb{R}^{M\times N},n\in\{1,2,\cdots,N\}$ .
$\boldsymbol{x}_{m}\in\mathbb{R}^{N}$	The $m$ -th row vector of matrix $\boldsymbol{X}\in\mathbb{R}^{M\times N},m\in\{1,2,\cdots,M\}$ .
$\boldsymbol{I}_{N\times N}\in\mathbb{R}^{N\times N}$	The identity matrix of size $N\times N$ , which has ones on the diagonal and zeros elsewhere.
tr $(\boldsymbol{X})\in\mathbb{R}$	The trace of a matrix $\boldsymbol{X}\in\mathbb{R}^{N\times N}$ , defined as the sum of its $N$ components on the diagonal.
col $(\boldsymbol{X})\subseteq\mathbb{R}^{M}$	The column space of a matrix $\boldsymbol{X}\in\mathbb{R}^{M\times N}$ , which consists of all linear combinations of its column vectors.
rank $(\boldsymbol{X})\in\mathbb{R}$	The rank of matrix $\boldsymbol{X}$ , defined as the maximum number of linearly independent column (or row) vectors of $\boldsymbol{X}$ .
vec $(\boldsymbol{X})\in\mathbb{R}^{MN}$	The vectorization of the matrix $\boldsymbol{X}\in\mathbb{R}^{M\times N}$ in the row-by-row stacking way.
$\boldsymbol{D}^{+}\in\mathbb{R}^{N\times M}$	The pseudoinverse of a matrix $\boldsymbol{D}\in\mathbb{R}^{M\times N}$ .
$\boldsymbol{O}^{N}\subset\mathbb{R}^{N\times N}$	The set of $N\times N$ orthogonal matrices.
$\boldsymbol{X}\in\boldsymbol{O}^{N}$	The column vectors $\boldsymbol{x}^{1},\cdots,\boldsymbol{x}^{N}$ of matrix $\boldsymbol{X}$ are orthogonal.
$\boldsymbol{S}^{N}\subset\mathbb{R}^{N\times N}$	The set of $N\times N$ real symmetric matrices.
$\boldsymbol{S}_{+}^{N}\subset\boldsymbol{S}^{N}$	The subset of $\boldsymbol{S}^{N}$ that contains positive semidefinit matrices.
$\\|\boldsymbol{w}\\|_{1}$	The $\ell_{1}$ norm of a vector, calculated as the sum of the absolute vector values.
$\\|\boldsymbol{w}\\|_{2}$	The $\ell_{2}$ norm of a vector, calculated as the square root of the sum of the squared vector values.
$\\|\boldsymbol{w}\\|_{\infty}$	The $\ell_{\infty}$ norm of a vector, calculated as the maximum of the absolute vector values.
$\\|\boldsymbol{W}\\|_{0}$	The $\ell_{0}$ norm, i.e., cardinality of a matrix, defined as the number of nonzero components.
$\\|\boldsymbol{W}\\|_{1}$	The $\ell_{1}$ norm of a matrix, calculated as the maximum of the $\ell_{1}$ norm of the column vectors.
$\\|\boldsymbol{W}\\|_{2}$	The $\ell_{2}$ norm of a matrix, calculated as its maximum singular value.
$\\|\boldsymbol{W}\\|_{F}$	The Frobenius norm of a matrix, calculated as the square root of the sum of the squared matrix values.
$\{\sigma_{r}(\boldsymbol{W})\}_{r=1}^{R}$	The set of non-increasing ordered singular values of matrix $\boldsymbol{W}$ .
$\\|\boldsymbol{W}\\|_{*}$	The trace norm of a matrix, defined as the sum of its singular values, i.e., $\sum_{r=1}^{R}\sigma_{r}(\boldsymbol{W})$ .
$\\|\boldsymbol{W}\\|_{\infty}$	The $\ell_{\infty}$ norm of a matrix, calculated as the maximum of the $\ell_{1}$ norm of the row vectors.
$\\|\boldsymbol{W}\\|_{p,q}$	The $\ell_{p,q}$ norm of a matrix, defined as the $q$ -norm of the vector whose components are $p$ -norm of $~{}\boldsymbol{W}$ ’s row vectors.
$\\|\boldsymbol{W}\\|_{1,1}$	The $\ell_{1,1}$ norm of a matrix, defined as the sum of the absolute matrix components.
$\\|\boldsymbol{W}\\|_{1,2}$	The $\ell_{1,2}$ norm of a matrix, calculated as the $\ell_{2}$ norm of the vector whose components are $\ell_{1}$ norm of the row vectors.
$\\|\boldsymbol{W}\\|_{2,1}$	The $\ell_{2,1}$ norm of a matrix, calculated as the sum of the $\ell_{2}$ norm of the row vectors.

To establish a unified formulation, we start the review of traditional methods by defining a common framework. The notations for subsequent discussions are summarized in Table 3. Building upon this, we initiate our discussion with multiple standard regression models for each task as a paradigm. The weights of these homogeneous models can be arranged into one weight matrix, catalyzing a series of MTL studies through matrix regularization techniques in the traditional era. We denote by $\{(\boldsymbol{X}^{(t)},\boldsymbol{y}^{(t)})\}_{t=1}^{T}$ our dataset across $T$ tasks. For each task indexed by $t={1,2,\cdots,T}$ , we are given $N_{t}$ samples with $D$ features, i.e., $\boldsymbol{X}^{(t)}\in\mathbb{R}^{N_{t}\times D}$ , and the corresponding response values $\boldsymbol{y}^{(t)}\in\mathbb{R}^{N_{t}}$ .

The single-task setting of these multiple linear regression problems is

(3)

\boldsymbol{y}^{(t)}={\boldsymbol{X}^{(t)}}\boldsymbol{w}^{(t)}+\epsilon^{(t)}% ,t=1,\cdots,T,

where $\boldsymbol{w}^{(t)}\in\mathbb{R}^{D}$ for any $t\in\{1,\cdots,T\}$ , $\epsilon^{(t)}\sim\mathcal{N}(0,\sigma_{t}^{2}\mathbb{I})$ is the error term independent of $\boldsymbol{X}^{(t)}$ , and $\sigma_{t}$ is determined by the system state for $t$ -th task. Each model is separately learned from independent samples $\{({\boldsymbol{x}_{1}^{(t)}}^{\top},y_{1}^{(t)}),\cdots,({\boldsymbol{x}_{N_{% t}}^{(t)}}^{\top},y_{N_{t}}^{(t)})\}$ .

A trivial simplification of the above linear regressions is that all tasks maintain the same feature size $D$ , thus leading to a natural idea of stacking weight vectors for these tasks: $\boldsymbol{W}=[\boldsymbol{w}^{(1)},\cdots,\boldsymbol{w}^{(T)}]\in\mathbb{R}% ^{D\times T}$ , where the matrix-based regularizers come into play. To estimate as $\boldsymbol{W}$ , the MTL method minimizes the objective function:

(4)

\min\limits_{\boldsymbol{W}}\sum\limits_{t=1}^{T}\frac{1}{n_{t}}\mathcal{L}^{(% t)}\left({\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t},\boldsymbol{y}^{(t)}\right)+% \lambda\Omega(\boldsymbol{W}),

where we consider the weights of multiple models, i.e., $\boldsymbol{W}$ , as a union, and denote by $\boldsymbol{w}^{t}$ the $t$ -th column of $\boldsymbol{W}$ . Normally, an identical loss function, e.g., mean squared error (MSE), is always applied to $\{{\mathcal{L}}^{(t)}\}_{t=1}^{T}$ , which originates from the $i.i.d.$ assumption of $\{\epsilon^{(t)}\}_{t=1}^{T}$ . To capture task relatedness from the Assumption 1 that multiple models are similar to each other, $\Omega$ is designed to take various regularization forms in traditional MTL. The overview of regularization techniques used in the traditional ML era for MTL (will be discussed in the following subsections) is presented in Table 2.1.

Model Name	Origin	Year	Typ	Matrix Regularizer	Vector Formalization
Regularized MTL	KDD	evgeniou2004regularized	Group regularization	Frobenius norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% _{1}\sum_{t=1}^{T}{\\|\boldsymbol{w}^{t}-\frac{1}{T}\sum_{t=1}^{T}\boldsymbol{w% }^{t}\\|}^{2}_{2}+\lambda_{2}\sum_{t=1}^{T}{\\|\boldsymbol{w}^{t}\\|}^{2}_{2}$
Learning Multiple Tasks with Kernel Methods	JMLR	evgeniou2005learning	Priori Sharing	Adaptive penalty	$\min\limits_{\boldsymbol{V},\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N% _{t}}\\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}% +\lambda\sum_{t=1}^{T}{\boldsymbol{w}^{t}}^{\top}\boldsymbol{V}^{+}\boldsymbol% {w}^{t},$ s.t. $\boldsymbol{V}\in\boldsymbol{S}_{+}^{D},$ $\boldsymbol{V}\in\boldsymbol{S}^{D}$
Alternating structure optimization	JMLR	ando2005framework	Decomposition	Frobenius norm	$\min\limits_{\{\boldsymbol{W},\boldsymbol{V}\},\Theta}\frac{1}{2}\sum_{t=1}^{T% }\frac{1}{N_{t}}\\|{\boldsymbol{X}^{(t)}}(\boldsymbol{w}^{t}+\Theta^{\top}% \boldsymbol{v}^{t})-\boldsymbol{y}^{t}\\|_{2}^{2}+\lambda\sum_{d=1}^{D}\\|% \boldsymbol{w}_{d}\\|_{2}^{2}$ , s.t. $\Theta\Theta^{\top}=\boldsymbol{I}_{h\times h}$
Multi-task feature selection	Tech. Rep.¹	obozinski2006multi	Group-sparse learning	$\ell_{2,1}$ norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% \sum_{d=1}^{D}{\\|\boldsymbol{w}_{d}\\|}_{2}$
Multi-task Lasso	Thesis²	zhang2006a	Group-sparse learning	$\ell_{\infty,1}$ norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% \sum_{d=1}^{D}{\\|\boldsymbol{w}_{d}\\|}_{\infty}$
Multi-task feature learning	NeurIPS	argyriou2006multi	Group-sparse learning, feature learning	$\ell_{2,1}$ norm	$\min\limits_{\boldsymbol{U},\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N% _{t}}\\|({\boldsymbol{X}^{(t)}}\boldsymbol{U})\boldsymbol{w}^{t}-\boldsymbol{y}% ^{(t)}\\|^{2}_{2}+\lambda(\sum_{d=1}^{D}{\\|\boldsymbol{w}_{d}\\|}_{2})^{2}$ , s.t. $\boldsymbol{U}\in\boldsymbol{O}^{D}$
Convex multi-task feature learning	Mach. Lea.	argyriou2008convex	Feature learning	Adaptive penalty	$\min\limits_{\boldsymbol{V},\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N% _{t}}\\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}% +\lambda\sum_{t=1}^{T}{\boldsymbol{w}^{t}}^{\top}\boldsymbol{V}^{+}\boldsymbol% {w}^{t},$ s.t. $\boldsymbol{V}\in\boldsymbol{S}_{+}^{D},$ tr $(\boldsymbol{V})\leq 1$ , col $(\boldsymbol{W})\subseteq$ col $(\boldsymbol{V})$
Low rank MTL	ICML	ji2009accelerated	Low-rank learning	Trace norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% \\|\boldsymbol{W}\\|_{*}$
Convex ASO	ICML	chen2009convex	—	—	$\min\limits_{\boldsymbol{U},\Theta}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{u}^{t}-\boldsymbol{y}^{t}\\|_{2}^{2}+\lambda% \eta(1-\eta)\text{tr}(\boldsymbol{U}^{\top}(\eta\boldsymbol{I}+\Theta^{\top}% \Theta)^{-1}\boldsymbol{U}),~{}~{}s.t.~{}\Theta\Theta^{\top}=\boldsymbol{I}_{h% \times h}$
Dirty block-sparse model	NeurIPS	jalali2010dirty	Group-sparse learning, decomposition	$\ell_{\infty,1}$ norm $+$ $\ell_{1,1}$ norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}(\boldsymbol{s}^{t}+\boldsymbol{b}^{t})-\boldsymbol{y}^{(% t)}\\|^{2}_{2}+\lambda_{1}\sum_{d=1}^{D}{\\|\boldsymbol{s}_{d}\\|}_{1}+\lambda_{2% }\sum_{d=1}^{D}{\\|\boldsymbol{b}_{d}\\|}_{\infty}$ , s.t. $\boldsymbol{W}=\boldsymbol{S}+\boldsymbol{B}$
Sparse multi-task Lasso	NeurIPS	lee2010adaptive	Group-sparse learning	Weighted $\ell_{2,1}$ norm $+$ weighted $\ell_{1,1}$ norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\rho_{d}{\\|\boldsymbol{w}_{d}\\|}_{2}+\lambda_{2}\sum_{d=1}^{% D}\theta_{d}{\\|\boldsymbol{w}_{d}\\|}_{1}$
\cdashline1-6				Weighted $\ell_{2,1}$ norm $+$ weighted $\ell_{1,1}$ norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\rho_{d}{\\|\boldsymbol{w}_{d}\\|}_{2}+\lambda_{2}\sum_{d=1}^{% D}\theta_{d}{\\|\boldsymbol{w}_{d}\\|}_{1}+\log Z(\boldsymbol{\rho},\boldsymbol{% \theta})$ ,
Adaptive multi-task Lasso	NeurIPS	lee2010adaptive	Group-sparse learning	$+$ adaptive penalty	$P(\boldsymbol{W}\|\boldsymbol{\rho},\boldsymbol{\theta})=\frac{1}{Z(\boldsymbol% {\rho},\boldsymbol{\theta})}\prod_{d=1}^{D}\prod_{t=1}^{T}\exp(-\theta_{d}% \lvert w_{n,t}\rvert)\times\prod_{d=1}^{D}\exp(-\rho_{d}\\|\mathbf{w}_{d}\\|_{2})$
					$\min\limits_{\mathbf{M}_{0},\ldots,\mathbf{M}_{T}}\gamma_{0}\\|\mathbf{M}_{0}-% \mathbf{I}\\|_{F}^{2}+\sum\nolimits_{t=1}^{T}\left[\gamma_{t}\\|\mathbf{M}_{t}\\|% _{F}^{2}+\sum\nolimits_{(i,j)\in J_{t},j\neq i}d_{t}^{2}(\mathbf{x}_{i},% \mathbf{x}_{j})+\sum\nolimits_{(i,j,k)\in S_{t}}\xi_{ijk}\right]$
Large margin multi-task metric learning	NeurIPS	parameswaran2010large	Priori Sharing	Frobenius norm	s.t. $\forall t,\forall(i,j,k)\in S_{t}\colon\quad d_{t}^{2}(\mathbf{x}_{i},\mathbf{% x}_{k})-d_{t}^{2}(\mathbf{x}_{i},\mathbf{x}_{j})\geq 1-\xi_{ijk};\xi_{ijk}\geq 0% ;\mathbf{M}_{0},\mathbf{M}_{1},\ldots,\mathbf{M}_{T}\geq 0$
Hierarchical multitask structured output learning	NeurIPS	gornitz2011hierarchical	Priori Sharing	Frobenius norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\frac{1% }{2}\sum_{t=1}^{T}\|\|\boldsymbol{w}\|\|_{2}^{2}-\lambda\boldsymbol{w}^{T}% \boldsymbol{w}_{p}$ , where $p$ is the parent node.
			low-rank learning
Robust MTL	KDD	chen2011integrating	Decomposition, group-sparse learning,	Trace norm + $\ell_{2,1}$ norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\\|{\boldsymbol{X}^{(t)}}(% \boldsymbol{l}^{t}+\boldsymbol{s}^{t})-\boldsymbol{y}^{(t)}\\|_{2}^{2}+\lambda_% {1}\\|\boldsymbol{L}\\|_{*}+\lambda_{2}\sum_{t=1}^{T}\\|\boldsymbol{s}_{t}\\|_{2}$ , s.t. $\boldsymbol{W}=\boldsymbol{L}+\boldsymbol{S}$
Temporal group Lasso	KDD	zhou2011multi	Group-sparse learning	Frobenius norm + $\ell_{2,1}$ norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\\|\boldsymbol{w}_{d}\\|_{2}^{2}+\lambda_{2}\sum_{t=1}^{T-1}\\|% \boldsymbol{w}^{t}-\boldsymbol{w}^{t+1}\\|_{2}^{2}+\lambda_{3}\sum_{d=1}^{D}\\|% \boldsymbol{w}_{d}\\|_{2}$
Clustered MTL	NeurIPS	zhou2011clustered	task clustering	Clustering penalty + $\ell_{2,2}$ norm	$\min\limits_{\boldsymbol{W},\boldsymbol{F}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N% _{t}}\\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{t}\\|_{2}^{2}+% \lambda_{1}(\text{tr}(\boldsymbol{W}^{\top}\boldsymbol{W})-\text{tr}(% \boldsymbol{F}^{\top}\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{F}))+% \lambda_{2}\sum_{t=1}^{T}{\\|\boldsymbol{w}^{t}\\|}^{2}_{2},$
Clustered MTL	NeurIPS	zhou2011clustered	task clustering	Clustering penalty + $\ell_{2,2}$ norm	$~{}~{}\text{s.t.}~{}\boldsymbol{F}_{t,j}=1/\sqrt{n_{j}}~{}\text{if}~{}t\in% \mathcal{C}_{j}~{}\text{otherwise}~{}0,$ $t=1,\cdots,T$ , where $n_{j}$ is the #task in the $j$ -th cluster $\mathbf{\mathcal{C}}_{j}$ .
			Decomposition, sparse learning,
Sparse and low rank MTL	TKDD	chen2012learning	low-rank learning	$\ell_{1,1}$ norm + trace norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% \sum_{d=1}^{D}\\|\boldsymbol{p}_{d}\\|_{1}$ , s.t. $\boldsymbol{W}=\boldsymbol{P}+\boldsymbol{Q},\\|\boldsymbol{Q}\\|_{*}\leq\tau$
Convex fused sparse group Lasso	KDD	zhou2012modeling	Group-sparse learning	$\ell_{1,1}$ norm $+$ $\ell_{2,1}$ norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\\|\boldsymbol{w}_{d}\\|_{1}+\lambda_{2}\sum_{t=1}^{T-1}\\|% \boldsymbol{w}^{t}-\boldsymbol{w}^{t+1}\\|_{1}+\lambda_{3}\sum_{d=1}^{D}\\|% \boldsymbol{w}_{d}\\|_{2}$
Adaptive multi-task elastic-net	SDM	chen2012adaptive	Group-sparse learning	$\ell_{2,1}$ norm $+$ Frobenius norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}{\\|\boldsymbol{w}_{d}\\|}_{2}+\lambda_{2}\sum_{d=1}^{D}\\|% \boldsymbol{w}_{d}\\|_{2}^{2}$
Multi-level Lasso	ICML	lozano2012multi	Decomposition, sparse learning	$\ell_{1,1}$ norm + adaptive penalty	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\theta_{d}+\lambda_{2}\sum_{d=1}^{D}\\|\boldsymbol{% \boldsymbol{\gamma}}_{d}\\|_{1}$ , s.t. $\boldsymbol{W}=\vec{\boldsymbol{\theta}}\boldsymbol{\Lambda}\boldsymbol{\Gamma% },\vec{\boldsymbol{\theta}}\geq\boldsymbol{0}$
Robust multi-task feature learning	KDD	gong2012robust	Decomposition, group-sparse learning	$\ell_{2,1}$ norm + $\ell_{1,2}$ norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% _{1}\sum_{d=1}^{D}\\|\boldsymbol{p}_{d}\\|_{2}+\lambda_{2}\sqrt{\sum_{d=1}^{D}\\|% \boldsymbol{q}_{d}\\|_{1}^{2}}$ , s.t. $\boldsymbol{W}=\boldsymbol{P}+\boldsymbol{Q}$
Multi-stage multi-task feature learning	NeurIPS	gong2012multi	Sparse learning	Capped $\ell_{1}$ norm (zhang2010analysis)	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% \sum_{d=1}^{R}\min\{\\|\boldsymbol{w}_{d}\\|_{1},\tau\}$
Convex formulation for MTL	IJCAI	zhang2012convex	Priori sharing	Clustering penalty	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\frac{% \lambda_{1}}{2}$ tr $(\boldsymbol{W}\boldsymbol{W}^{T})+\frac{\lambda_{2}}{2}$ tr $(\boldsymbol{W}\boldsymbol{\Omega}^{-1}\boldsymbol{W}^{T})$ s.t. $\boldsymbol{\Omega}\in\boldsymbol{S}_{+}^{D}$ , tr $\boldsymbol{\Omega}=1$
Multi-linear multi-task learning	ICML	romera2013multilinear	Low-rank learning	Overlapped tensor trace norm	$\min\limits_{\boldsymbol{\mathcal{W}}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}% \\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+% \lambda\sum_{k=1}^{N}\\|\boldsymbol{W}_{(k)}\\|_{*}$ where $\boldsymbol{W}_{(k)}$ is the mode- $k$ unfolding of tensor $\boldsymbol{\mathcal{W}}\in\mathbb{R}^{D\times I_{2}\times\cdots\times I_{N}}$ .
Regularization approach to learn MTL	TKDD	zhang2014regularization	Priori sharing	Clustering penalty + $\ell_{2,2}$ norm	$\min\limits_{\boldsymbol{V},\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N% _{t}}\\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}% +\frac{\lambda}{2}\sum_{t=1}^{T}\|\|\boldsymbol{w}^{t}\|\|_{2}^{2}+$ tr $(\boldsymbol{W}\boldsymbol{\Omega}^{-1}\boldsymbol{W}^{T})+d$ ln $\boldsymbol{\Omega}$ s.t. $\boldsymbol{\Omega}\in\boldsymbol{S}_{+}^{D}$
Multi-linear multi-task learning	NeurIPS	wimalawarne2014multitask	Low-rank learning	Scaled latent tensor trace norm	$\min\limits_{\boldsymbol{\mathcal{W}}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}% \\|{\boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\inf% _{\boldsymbol{\mathcal{W}}^{(1)}+\cdots+\boldsymbol{\mathcal{W}}^{(N)}=% \boldsymbol{\mathcal{W}}}\lambda\sum_{k=1}^{N}I_{k}^{-1/2}\\|\boldsymbol{W}_{(k% )}^{(k)}\\|_{*}$ where $\boldsymbol{\mathcal{W}}\in\mathbb{R}^{D\times I_{2}\times\cdots\times I_{N}}$ is a tensor.
Task Tree model	KDD	han2015learning	task clustering	$\ell_{2,2}$ norm	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\sum_{h=1}^{H}\boldsymbol{w}_{h}^{t}-\boldsymbol{y}^{t}\\|% _{2}^{2}+\sum_{h=1}^{H}\lambda_{h}\sum_{i<j}^{T}\\|\boldsymbol{w}_{h}^{i}-% \boldsymbol{w}_{h}^{j}\\|^{2}_{2},\text{s.t.}\|\boldsymbol{w}_{h-1}^{i}-% \boldsymbol{w}_{h-1}^{j}\|\succeq\|\boldsymbol{w}_{h}^{i}-\boldsymbol{w}_{h}^{j}% \|,\forall h\geq 2,\forall i<j$
Reduced rank multi-stage MTL	AAAI	han2016multi	Low-rank learning	Capped trace norm (sun2013robust)	$\min\limits_{\boldsymbol{W}}\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{% \boldsymbol{X}^{(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda% \sum_{r=1}^{R}\min\{\sigma_{r}(\boldsymbol{W}),\tau\}$

	$\displaystyle\min\limits_{\boldsymbol{V},\boldsymbol{W}}$	$\displaystyle\frac{1}{2}\sum\limits_{t=1}^{T}\frac{1}{N_{t}}\\|{\boldsymbol{X}^% {(t)}}\boldsymbol{w}^{t}-\boldsymbol{y}^{(t)}\\|^{2}_{2}+\lambda\sum\limits_{t=% 1}^{T}{\boldsymbol{w}^{t}}^{\top}\boldsymbol{V}^{+}\boldsymbol{w}^{t},$
(15)		$\displaystyle~{}s.t.~{}$	$\displaystyle\boldsymbol{V}\in\boldsymbol{S}_{+}^{D},\text{tr}(\boldsymbol{V})% \leq 1,\text{col}(\boldsymbol{W})\subseteq\text{col}(\boldsymbol{V}).$

	$\displaystyle\min_{\boldsymbol{W}}$	$\displaystyle\sum\limits_{t=1}^{T}\mathcal{L}^{(t)}\left(f(\boldsymbol{X}^{(t)% },\boldsymbol{w}^{t}),\boldsymbol{y}^{(t)}\right)+\lambda_{1}~{}\text{reg}_{1}% (\boldsymbol{P})+\lambda_{2}~{}\text{reg}_{2}(\boldsymbol{Q}),$
(23)		$\displaystyle s.t.~{}$	$\displaystyle\boldsymbol{W}=\boldsymbol{P}+\boldsymbol{Q}~{}\text{or}~{}% \boldsymbol{W}=\boldsymbol{P}\cdot\boldsymbol{Q},$

	$\displaystyle\min\limits_{\{\boldsymbol{W},\boldsymbol{V}\},\Theta}$	$\displaystyle\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{\boldsymbol{X}^{(t)}}% \boldsymbol{u}^{(t)}-\boldsymbol{y}^{(t)}\\|_{2}^{2}+\lambda\sum_{d=1}^{D}\\|% \boldsymbol{w}_{d}\\|_{2}^{2},$
(28)		s.t.	$\displaystyle\Theta\Theta^{\top}=\boldsymbol{I}$

	$\displaystyle\sum_{t=1}^{T}\frac{1}{N_{t}}\max$	$\displaystyle\{\boldsymbol{0},1-(\boldsymbol{X}^{(t)}\boldsymbol{u}^{(t)})% \cdot\boldsymbol{y}^{(t)}\}+\lambda_{1}\\|\boldsymbol{u}^{(t)}-\Theta^{\top}% \boldsymbol{v}^{(t)}\\|^{2}+\lambda_{2}\\|\boldsymbol{u}^{(t)}\\|^{2},$
(29)		s.t.	$\displaystyle\Theta\Theta^{\top}=\boldsymbol{I},$

	$\displaystyle\rho_{d}=\sum\limits_{i}\varv_{i}f_{i}^{d}~{}\text{and}~{}\theta_% {d}=\sum\limits_{i}\omega_{i}f_{i}^{d},d=1,\cdots,D,$
(10)		$\displaystyle\text{s.t.}\quad\sum\limits_{i}\varv_{i}=\sum\limits_{i}\omega_{i% }=1,$

	$\displaystyle\sum_{t=1}^{T}$	$\displaystyle\frac{1}{N_{t}}\max\{\boldsymbol{0},1-(\boldsymbol{X}^{(t)}% \boldsymbol{u}^{(t)})\cdot\boldsymbol{y}^{(t)}\}+\boldsymbol{G}(\boldsymbol{U}% ,\Theta),$
(31)		s.t.	$\displaystyle\Theta\Theta^{\top}=\boldsymbol{I}.$

	$\displaystyle\min_{\boldsymbol{w}_{0},\boldsymbol{v}_{0},\xi_{it}}\big{\{}\sum% _{t=1}^{T}\sum_{i=1}^{m}\xi_{it}+\frac{\lambda_{1}}{T}\sum_{t=1}^{T}\\|% \boldsymbol{v}_{t}\\|_{2}^{2}+\lambda_{2}\\|\boldsymbol{w}_{0}\\|_{2}^{2}\big{\}},$
(34)		$\displaystyle s.t.\quad y_{it}(\boldsymbol{w}_{0}+\boldsymbol{v}_{t})\cdot% \boldsymbol{x}_{it}\geq 1-\xi_{it},\,\,\xi_{it}\geq 0,\forall i\in\{1,2,\dots,% m\}\,\text{and}\,\,t\in\{1,2,\dots,T\}$

	$\displaystyle<f_{l}(\boldsymbol{x})f_{k}(\boldsymbol{x}^{\top})>=K_{lk}^{f}k^{% x}<\boldsymbol{x},\boldsymbol{x}^{\top}>,y_{il}\sim\mathcal{N}(f_{l}(x_{i}),% \sigma_{l}^{2}),l,k\in\{1,\dots,T\},i\in\{1,\dots N\}$
(35)		$\displaystyle\min_{\boldsymbol{\theta}_{X}}\bigg{(}N\log\|<F^{T}(\boldsymbol{K}% ^{x}(\boldsymbol{\theta}_{x}))^{-1}F>\|+T\log\|\boldsymbol{K}^{x}(\boldsymbol{% \theta}_{x})\|\bigg{)},$

	$\displaystyle\min_{\boldsymbol{W},\boldsymbol{\Omega}}{\mathcal{L}}(% \boldsymbol{W})+\lambda_{1}\|\|\boldsymbol{W}\|\|_{F}^{2}+\lambda_{2}tr(% \boldsymbol{W}\boldsymbol{\Omega}^{-1}\boldsymbol{W}^{T})$
(36)		$\displaystyle s.t.\quad\boldsymbol{\Omega}\succ 0,tr(\boldsymbol{\Omega})\leq 1,$

	$\displaystyle\min_{\boldsymbol{w}_{t},\xi_{it}}\big{\{}\sum_{t=1}^{T}\sum_{i=1% }^{m}\xi_{it}+\frac{\lambda_{1}\lambda_{2}}{T(\lambda_{1}+\lambda_{2})}\sum_{t% =1}^{T}\\|\boldsymbol{w}_{t}\\|^{2}+\frac{\lambda_{1}^{2}}{T(\lambda_{1}+\lambda% _{2})}\sum_{t=1}^{T}\\|\boldsymbol{w}_{t}-\frac{1}{T}\sum_{s=1}^{T}\boldsymbol{% w}_{s}\\|^{2}\big{\}},$
(38)		$\displaystyle s.t.\quad y_{it}\cdot\boldsymbol{w}_{t}\cdot\boldsymbol{x}_{it}% \geq 1-\xi_{it},\,\,\xi_{it}\geq 0,$

	$\displaystyle\min\limits_{\boldsymbol{W},\boldsymbol{F}}$	$\displaystyle\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{\boldsymbol{X}^{(t)}}% \boldsymbol{w}^{t}-\boldsymbol{y}^{t}\\|_{2}^{2}+\lambda_{1}(\text{tr}(% \boldsymbol{W}^{\top}\boldsymbol{W})-\text{tr}(\boldsymbol{F}^{\top}% \boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{F}))+\lambda_{2}\sum_{t=1}^{T}{% \\|\boldsymbol{w}^{t}\\|}^{2}_{2},$
(39)		s.t.	$\displaystyle\boldsymbol{F}_{t,j}=1/\sqrt{n_{j}}~{}\text{if}~{}t\in\mathcal{C}% _{j}~{}\text{otherwise}~{}0,t=1,\cdots,T,$

	$\displaystyle\min\limits_{\boldsymbol{W}}$	$\displaystyle\frac{1}{2}\sum_{t=1}^{T}\frac{1}{N_{t}}\\|{\boldsymbol{X}^{(t)}}% \sum_{h=1}^{H}\boldsymbol{w}_{h}^{t}-\boldsymbol{y}^{t}\\|_{2}^{2}+\sum_{h=1}^{% H}\lambda_{h}\sum_{i<j}^{T}\\|\boldsymbol{w}_{h}^{i}-\boldsymbol{w}_{h}^{j}\\|^{% 2}_{2},$
(40)		s.t.	$\displaystyle\|\boldsymbol{w}_{h-1}^{i}-\boldsymbol{w}_{h-1}^{j}\|\succeq\|% \boldsymbol{w}_{h}^{i}-\boldsymbol{w}_{h}^{j}\|,\forall h\geq 2,\forall i<j,$

Notation	Description
$b,B$	Batch size.
$lr$	Learning rate.
${\mathcal{X}}_{l}^{t}\in\mathbb{R}^{(B\times)H\times W\times C}$	Feature maps output from $l$ -th layer of $t$ -th task, where $(B,)H,W,C$ are (batch size,) #height, #width, and #channel.
${\mathcal{W}}\in\mathbb{R}^{S\times S\times C_{\text{in}}\times C_{\text{out}}}$	Convolution filter, where $S$ denotes the size of filter, and $C_{\text{in}},C_{\text{out}}$ denote the number of input and output channels, respectively.
$\text{exp}(\cdot)$	Exponential function.
$\sigma(\cdot)$	Sigmoid function, where $\sigma(x)=1/(1+\text{exp}(-x))$ .
$\text{softmax}(\cdot)$	Softmax function, where $[\text{softmax}(\boldsymbol{x})]_{j}=\text{exp}(x_{j})/\sum_{i}\text{exp}(x_{i})$ for any entry index $j$ .
$\text{sim}(\cdot,\cdot)$	An arbitrary similarity function, e.g. cosine similarity cos $(\cdot,\cdot)$ .
$\odot$	The element-wise dot product.
$LN(\cdot)$	Layer norm.
$MHSA(q,k,v)$	Multi-head self-attention operator.
$CONV_{{\mathcal{W}}}(\cdot)$	Convolution operation parametrized by ${\mathcal{W}}$ .
$RESHAPE(\cdot)$	Reshape operation to rearrange the original feature maps in $\mathbb{R}^{H\times W\times C}$ space into a new $\mathbb{R}^{HW\times C}$ space.

	$\displaystyle[RESHAPE(\mathcal{X}^{(t)})]_{j}=$	$\displaystyle[RESHAPE(\mathcal{X}^{(t)}]_{j}+\sum\nolimits_{s\neq t}\sum% \nolimits_{k\in\mathcal{N}(v_{j})}\beta_{s\rightarrow t}\boldsymbol{A}_{j,k}^{% s\rightarrow t}\times[RESHAPE(\mathcal{X}^{(t)})]_{k},$
(66)		s.t.	$\displaystyle\boldsymbol{A}_{P_{i}}^{s\rightarrow t}=\boldsymbol{A}_{P_{i}}^{(% t)}\odot\boldsymbol{A}_{P_{i}}^{(s)}/[\boldsymbol{1}^{\top}(\boldsymbol{A}_{P_% {i}}^{(t)}\odot\boldsymbol{A}_{P_{i}}^{(s)})\boldsymbol{1}],s,t=1,\cdots,T,$

	$\displaystyle\boldsymbol{Q}=RESHAPE(CONV_{{\mathcal{W}}_{q}}(\mathcal{X}_{t}))% ,\boldsymbol{K}=RESHAPE(CONV_{{\mathcal{W}}_{k}}(\mathcal{X}_{s})),$
(68)		$\displaystyle\boldsymbol{V}=RESHAPE(CONV_{{\mathcal{W}}_{v}}(\mathcal{X}_{s})),$

	$\displaystyle\left(\sum_{t=1}^{T}\alpha^{(t)}\triangledown{\mathcal{L}}^{(t)}(% \boldsymbol{W})\right)^{T}\frac{\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W% })}{\left\\|\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W})\right\\|}=\left(% \sum_{t=1}^{T}\alpha^{(t)}\triangledown{\mathcal{L}}^{(t)}(\boldsymbol{W})% \right)^{T}\frac{\triangledown{\mathcal{L}}^{(1)}(\boldsymbol{W})}{\left\\|% \triangledown{\mathcal{L}}^{(1)}(\boldsymbol{W})\right\\|},\text{ for }t\in\{2,% \cdots,T\}$
	$\displaystyle\sum_{t=1}^{T}\alpha^{(t)}=1.$

Dataset	Source	Year	Modality	Task	Synopsis	#Task	#Sample	Availability
School Data	ILEA	mortimore1988school	Table	Regression	Predicting student exam scores based on 27 school features.	139	15,362	Official
SARCOS Data	Humanoid Robotics	2000	Table	Regression	Estimate inverse dynamics model.	7	44,484/4449	Official
Computer Survey Data	Survey	lenk1996hierarchical	Table	Regression	Likelihood of purchasing personal computers.	179	-	-
Climate Dataset	Sensor network	2017-now	Table	Regression	Real-time climate data collected from four climate stations.	7	-	Official
20 Newsgroups	Netnews articles	Lang95	Text	Classification	Hierarchical text classification.	20	19,000	Official
Reuters-21578 Collection	Reuters	1996	Text	Classification	Reuters news documents with hierarchical categories.	90	21,578	Official
MultiMNIST Dataset	MNIST	sabour2017dynamic	Image	Classification	Classify the digits on the different positions.	2	-	Official
ImageCLEF-2014	Caltech, ImageNet, Pascal, Bing	2014	Image	Classification	Benchmark dataset for domain adaptation.	4	2,400	Official
Office-Caltech Dataset	Office, Caltech	gong2012geodesic	Image	Classification	Benchmark dataset for the annotation and retrieval of images.	4	2,533	Official
Office-31 Dataset	Amazon, DSLR, Webcam	saenko2010adapting	Image	Classification	Objects commonly encountered in office settings.	3	4,110	Official
Office-Home Dataset	Office	venkateswara2017deep	Image	Classification	Object recognition and domain adaptation in the era of deep learning.	4	15,588	Official
DomainNet Dataset	UDA	peng2019moment	Image	Classification	Multi-source unsupervised domain adaptation research	6	600,000	Official
EMMa Dataset	Amazon	standley2023extensible	Image, Text	Classification	Amazon product listings for category prediction	-	2,800,000	Official
SYNTHIA Dataset	European Union	ros2016synthia	Image	Classification	A synthetic dataset for semantic segmentation.	-	13,400	Official
SVHN Dataset	Stanford	yang2021few	Image	Classification	A digit classification benchmark dataset.	-	600,000	Official
CelebA Dataset	MMLAB	liu2018large	Image	Classification	A large-scale face attributes dataset.	40	200,000	Official
CityScapes Dataset	Daimler AG	cordts2016cityscapes	Image	Dense prediction	Semantic urban scene understanding	-	5,000	Official
NYU-Depth Dataset V2	New York University	silberman2012indoor	Image	Dense prediction	Indoor scene understanding with per-pixel labels	3	35,064	Official
PASCAL VOC Project	University of Oxford	everingham2010pascal	Image	Dense prediction	Object recognition with multiple tasks	-	-	Official
Taskonomy Dataset	Standard	zamir2018taskonomy	Image	Dense prediction	Diverse dataset with 26 tasks for task transfer learning	26	4,000,000	Official
STREET	Amazon	ribeiro2023street	Text	Reasoning	The multi-task structured reasoning and explanation benchmark	-	-	-
VKITTI2 Dataset	Naver	cabon2020virtual	Video	Segmentation	A video dataset which is automatically labeled with ground truth	5	-	Official
XTREME	Carnegie Mellon	hu2020xtreme	Text	Translation, QA	A multilingual benchmark for evaluating cross-lingual generalisation	9	400,000	-
Deepfashion Dataset	Shopping Websites	liu2016deepfashion	Image	Classification	A large-scale clothes dataset with comprehensive annotations	2	800,000	Official
ACE05 Dataset	News	2005	Text	Classification	A large corpus with annotated entities, relations and events	3	52,615	Official
ATIS Dataset	Airline	hemphill-etal-1990-atis	Text	Classification	A dataset with 17 unique intent categories.	3	5,871	Official

Library	Sprache	Supported Methods
RMTL	R	Sparse structure learning (tibshirani1996regression), multi-task feature selection (obozinski2006multi), low rank MTL (ji2009accelerated; pong2010trace), graph-based regularised MTL (widmer2010leveraging), multi-task clustering (gu2009learning)
MALSAR	Matlab	Sparse structure learning (tibshirani1996regression), regularized MTL (evgeniou2004regularized), multi-task feature selection (obozinski2006multi), dirty block-sparse model (jalali2010dirty), low rank MTL (ji2009accelerated; pong2010trace), convex ASO (chen2009convex), sparse & low rank MTL (chen2012learning), clustered MTL (zhou2011clustered), robust MTL (chen2011integrating), robust multi-task feature learning (gong2012robust), Temporal group Lasso (zhou2011multi), convex fused sparse group Lasso (zhou2012modeling), incomplete multi-source feature learning (yuan2012multi), multi-stage multi-task feature learning (gong2012multi), multi-task clustering (gu2009learning)
LibMTL	Python	Cross-stitch (misra2016cross), GradNorm (chen2018gradnorm), Uncertainty Weighting (kendall2018multi), MGDA-MTL (sener2018multi), MMoE (ma2018modeling), MultiNet++ (chennupati2019multinet++), LTB (guo2020learning), MTAN & DWA (liu2019end), PCGrad (yu2020gradient), GradDrop (chen2020just), CGC & PLE (tang2020progressive), IMTL (liu2021towards), GradVac (wang2021gradient), CAGrad (liu2021conflictaverse), DSelect-k (hazimeh2021dselect), RLW & RGW (lin2022reasonable), Nash-MTL (navon2022multi)

Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras

Abstract.

1. Introduction

1.1. Definition

1.1.1. Formal Definition

Definition 1 (Machine Learning, mitchell1997machine).

Definition 2 (Single-Task Learning).

Definition 3 (Multi-Task Learning).

1.2. Related Fields

Transfer Learning (TL)

Few-Shot Learning (FSL)

Meta-Learning

Lifelong Learning

Multi-View Learning (MVL)

In-Context Learning (ICL)

1.3. Motivation and Benefit

1.4. Mechanism and Explanation

Regularization

Inductive Bias

Feature Sharing

1.5. Contributions and Highlights

2. MTL Models

Formalization

I/O Configurations

Taxonomy

Assumption 1 (Parameter Relatedness).

Assumption 2 (Feature Richness).

Assumption 3 (Optimization Consistency).

2.1. Traditional Era: Provable but Restrictive

2.1.1. Feature Selection

Block-Wise Sparsity

Element-Wise Sparsity

2.1.2. Feature Transformation

2.1.3. Low-Rank Factorization

Matrix Factorization

Tensor Factorization

2.1.4. Decomposition

Form “𝑷+𝑸𝑷𝑸\boldsymbol{P}+\boldsymbol{Q}bold_italic_P + bold_italic_Q”

Form “𝑷⋅𝑸⋅𝑷𝑸\boldsymbol{P}\cdot\boldsymbol{Q}bold_italic_P ⋅ bold_italic_Q”

2.1.5. Priori Sharing

2.1.6. Task Clustering/Grouping

Horizontal Methods

Hierarchical Methods

2.2. DL Era: Effective and Diversified

Architecture Taxonomy

2.2.1. Feature fusion

2.2.2. Cascading

2.2.3. Knowledge Distillation (KD)

2.2.4. Cross-Task Attention

2.2.5. Scalarization Approach.

2.2.6. Multi-objective Optimization (MOO).

Definition 4.

2.2.7. Adversarial training

2.2.8. Mixture of Experts (MoE)

2.2.9. Graph based

2.2.10. Neural Architecture Search (NAS)

2.3. Foundation Model Era: Towards Unified and Versatile

2.3.1. Downstream Task Fine-Tuning

2.3.2. Task Prompting

2.3.3. Unified Generalist Models

3. Miscellaneous

3.1. Fairness and Bias in MTL

3.2. Security and Privacy in MTL

3.3. Distribution Shifts in MTL

3.4. Non-supervised MTL

3.5. Others

3.5.1. Applications of MTL

3.5.2. MTL+X

4. Resources

4.1. Dataset

4.1.1. Regression task

4.1.2. Classification task

4.1.3. Dense prediction task

4.1.4. Others

4.2. Software Resources

4.3. Evaluation Metric

4.3.1. Single-task Metric

Regression Task Metric

Classification Task Metric

Object Detection Task Metric

Form “ $\boldsymbol{P}+\boldsymbol{Q}$ ”

Form “ $\boldsymbol{P}\cdot\boldsymbol{Q}$ ”

$\Delta_{m}$ (maninis2019attentive)

$\Delta_{p}$ (lin2022reasonable)