Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras
Abstract.
Multi-Task Learning (MTL) is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to Single-Task Learning (STL), MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL’s key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the past twenty years, MTL has become widely recognized as a flexible and effective approach in various fields, including computer vision, natural language processing, recommendation systems, disease prognosis and diagnosis, and robotics. This survey provides a comprehensive overview of the evolution of MTL, encompassing the technical aspects of cutting-edge methods from traditional approaches to deep learning and the latest trend of pretrained foundation models. Our survey methodically categorizes MTL techniques into five key areas: regularization, relationship learning, feature propagation, optimization, and pre-training. This categorization not only chronologically outlines the development of MTL but also dives into various specialized strategies within each category. Furthermore, the survey reveals how the MTL evolves from handling a fixed set of tasks to embracing a more flexible approach free from task or modality constraints. It explores the concepts of task-promptable and -agnostic training, along with the capacity for zero-shot learning, which unleashes the untapped potential of this historically coveted learning paradigm. Overall, we hope this survey provides the research community with a comprehensive overview of the advancements in MTL from its inception in 1997 to the present in 2023. We address present challenges and look ahead to future possibilities, shedding light on the opportunities and potential avenues for MTL research in a broad manner. This project is publicly available at https://github.com/junfish/Awesome-Multitask-Learning.
Jun Yu\upstairs\affilone\affiltwo, , , Yutong Dai\upstairs\affilthree, Xiaokang Liu\upstairs\affiltwo\affilfour, Jin Huang\upstairs\affilfive, Yishan Shen\upstairs\affiltwo, Ke Zhang\upstairs\affilsix, |
Rong Zhou\upstairs\affilone, Eashan Adhikarla\upstairs\affilone, Wenxuan Ye\upstairs\affilone, Yixin Liu\upstairs\affilone, Zhaoming Kong\upstairs\affilseven, Kai Zhang\upstairs\affilone, |
Yilong Yin\upstairs\affilfive, Vinod Namboodiri\upstairs\affilone\affileight, Brian D. Davison\upstairs\affilone, Jason H. Moore\upstairs\affilnine, Yong Chen\upstairs\affiltwo, |
\upstairs\affilone Department of Computer Science and Engineering, Lehigh University, USA |
\upstairs\affiltwo Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, USA |
\upstairs\affilthree Department of Industrial and Systems Engineering, Lehigh University, USA |
\upstairs\affilfour Department of Statistics, University of Missouri, USA |
\upstairs\affilfive School of Software, Shandong University, China |
\upstairs\affilsix Department of Computer Science, University of Hong Kong, China |
\upstairs\affilseven Department of Computer Science and Engineering, South China University of Technology, China |
\upstairs\affileight Department of Community and Population Health, Lehigh University, USA |
\upstairs\affilnine Department of Computational Biomedicine, Cedars-Sinai Medical Center, USA |
This work includes efforts as a visiting student at Upenn.
\upstairsCorresponding to [email protected] oder [email protected].
Keywords: Deep Learning, Generative Pretrained Transformers, Multi-Objective Optimization, Multi-Task Learning, Pretrained Foundation Models, Prompt Learning
1. Introduction
In the introduction, we hope to answer the following five research questions (RQs) before we overview the methodologies of Multi-task Learning (MTL):
-
•
RQ1: What is the concept and definition of MTL? (See § 1.1)
-
•
RQ2: How does MTL distinguish itself from other learning paradigms? (See § 1.2)
-
•
RQ3: What motivates the use of MTL in learning scenarios? (See § 1.3)
-
•
RQ4: What underlying principles does the efficacy of MTL rest on? (See § 1.4)
-
•
RQ5: In what ways does our survey differentiate from previous studies? (See § 1.5)
In § 1.1, we progressively introduce Multi-Task Learning (MTL), starting with a broad sense and culminating in a formal definition. Subsequently, § 1.2 explores the position of MTL within the Machine Learning (ML) landscape, drawing comparisons with related paradigms such as Transfer Learning (TL), Few-Shot Learning (FSL), lifelong learning, Multi-View Learning (MVL), to name a few. § 1.3 delves into the motivations for employing MTL, offering insights from both explicit and subtle angles, while also addressing how MTL benefits the involved tasks. In § 1.4, we delve deeper into the fundamental mechanisms and theories underpinning MTL, specifically: 1) regularization, 2) inductive bias, and 3) feature sharing, providing an understanding of its underlying principles. Finally, § 1.5 reviews existing surveys on MTL, underscoring the unique contributions of our survey and laying out a structured roadmap for the remainder of this work. The structure of our survey is depicted in Fig. 2. Before delving into this survey, readers can quickly refer to Table 1 for a list of acronyms not related to datasets, institutions, and newly proposed methods, while an overview of mathematical notations is provided in Table 3 and Table 6.
Abbreviation | Expanded Form | Abbreviation | Expanded Form |
AD | Alzheimer’s Disease | AGM | Accelerated Gradient Method |
APM | Accelerated Proximal Method | CE | Cross-Entropy |
CNN | Convolutional Neural Network | CT | Computed Tomography |
CV | Computer Vision | DA | Domain Adaptation |
DL | Deep Learning | DNN | Deep Neural Network |
FCN | Fully Convolutional Network | FNN | Feedforward Neural Network |
FSL | Few Shot Learning | GAN | Generative Adversarial Network |
GCN | Graph Convolutional Network | GNN | Graph Neural Network |
GP | Gaussian Process | GPT | Generative Pretrained Transformer |
GPU | Graphics Processing Unit | GRL | Gradient Reversal Layer |
I/O | Input/Output | KD | Knowledge Distillation |
LLM | Large Language Model | LSTM | Long Short-Term Memory |
MAP | Maximum A Posteriori | MCI | Mild Cognitive Impairment |
MDP | Markov Decision Process | MIM | Masked Image Modeling |
MIML | Multi-Instance Multi-Label learning | MIMO | Multi-Input Multi-Output |
MISO | Multi-Input Single-Output | ML | Machine Learning |
MLM | Masked Language Modeling | MLP | Multi-Layer Perceptron |
MoE | Mixture-of-Experts | MOO | Multi-Objective Optimization |
MRI | Magnetic Resonance Imaging | MSE | Mean Squared Error |
MTL | Multi-Task Learning | MTRL | Multi-Task Reinforcement Learning |
MVL | Multi-View Learning | NAS | Neural Architecture Search |
NLI | Natural Language Inference | NLP | Natural Language Processing |
OCR | Optical Character Recognition | OOD | Out-Of-Distribution |
PET | Positron Emission Tomography | PFM | Pretrained Foundation Model |
PSD | Positive Semi-Definite | RL | Reinforcement Learning |
RNN | Recurrent Neural Network | seq2seq | sequence to sequence |
SIMO | Single-Input Multi-Output | SNP | Single Nucleotide Polymorphism |
SGD | Stochastic Gradient Descent | SSL | Self-Supervised Learning |
SOTA | State-Of-The-Art | STL | Single-Task Learning |
SVD | Singular Value Decomposition | SVM | Support Vector Machine |
TL | Transfer Learning | TPU | Tensor Processing Unit |
VLM | Vision-Language Model | VQA | Visual Question Answering |
ZSL | Zero-Shot Learning |
This table excludes abbreviations pertaining to datasets, institutions, and newly proposed methods.
1.1. Definition
The increasing popularity of MTL over the past few decades is evident in Fig. 3, which displays the trend in the number of papers associated with “allintitle: ‘multitask learning’ OR ‘multi-task learning’ ” as a keyword search, according to data from Google Scholar111https://scholar.google.com.
As the name suggests, MTL is a subfield of ML where multiple tasks are jointly learned. In this manner, we hope to leverage useful information across these related tasks and break from the tradition of performing different tasks in isolation. In Single-Task Learning (STL), data specific to the task at hand is the only source to couch a learner. However, MTL can conveniently transfer extra knowledge learned from other tasks. The essence of MTL is to exploit consensual and complementary information among tasks by combining data resources and sharing knowledge. This sheds light on a better learning paradigm that can reduce memory burden and data consumption, and improve training speed and testing performance. For instance, learning the monocular depth estimation (scaling the distance to the camera) (eigen2014depth) and semantic segmentation (assigning a class label to every pixel value) (fu1981survey) simultaneously in images is beneficial since both tasks need to perceive meaningful objects. MTL has become increasingly ubiquitous as experimental and theoretical analyses continue to validate its promising results. For example, using Face ID to unlock an iPhone is a typical but imperceptible MTL application that involves simultaneously locating the user’s face and identifying the user. In general, multitasking occurs when we attempt to handle two or more objectives during the optimization stage in practice.
Consequently, MTL exists everywhere in ML, even when performing STL with regularization. This can be understood as having one target task and an additional artificial task of human preference, such as learning a constrained model via regularizer or a parsimonious model via regularizer. These hypothesis preferences can serve as an inductive bias to enhance an inductive learner (caruna1993multitask). In the early exploration of MTL (caruana1997multitask), the extra information that the involved tasks provide is regarded as a domain-specific inductive bias for the other tasks. Since collecting training signals from other tasks is more practical than acquiring inductive bias from model design or human expertise, we can thus empower any ML models via this MTL paradigm.
1.1.1. Formal Definition
To comprehensively understand MTL, we provide a formal definition of MTL. Suppose we have a sample dataset drawn from the feature space , and its respective ground-truth label set drawn from the label space . We can define experience , domain , and task , where is the distribution of and maps a data sample to a prediction . These predictive values consist of the predictive label set . Following the ML settings, we should define a measurement , where is a function to measure the distance between any pairs of . More basic notations please refer to Table 3. Based on the definitions of four basic elements (experience, domain, task, and measurement) above, we first restate the general definition of machine learning by mitchell1997machine to a more exact form as follows.
Definition 1 (Machine Learning, mitchell1997machine).
A computer program is said to learn from experience with respect to a set of tasks and performance measurement , if its performance at tasks , as measured by , improves with experience .
The definition above inherently considers both single-task and multi-task scenarios during the ML process but deviates from a meticulous definition to characterize MTL that includes recent developments. Now, let us first define STL to induce the formal definition of MTL.
Definition 2 (Single-Task Learning).
A type of machine learning specified by and , where contains only one task (i.e. ) on a specific domain .
As recent developments in MTL focus more on heterogeneous tasks (e.g., regression classification) than homogeneous ones, each task should be represented by its own experience on its corresponding domain . Due to this diversity, we always employ distinct measurement to evaluate the learning performance of each task. We accordingly define the MTL as follows.
Definition 3 (Multi-Task Learning).
A super set of STL specified by and , where experience is with respect to task on its corresponding domain . Accordingly, MTL is a computer program to learn from the experience set with respect to the task set and the corresponding performance measurement set , if its total performance at any task , as measured by its corresponding , , improves with experience set .
We note that the formal MTL definition above has no conflict with the homogeneous or heterogeneous MTL.
1.2. Related Fields
Having established a formal definition of MTL grounded in fundamental ML elements, a thorough understanding can be achieved by analytically comparing it with related domains. These include Transfer Learning (TL), Meta-Learning, and In-Context Learning (ICL), among others. This comparison not only clarifies the distinct characteristics of MTL but also situates it within the broader context of these interconnected fields.
Transfer Learning (TL)
TL (pan2009survey) is a prevalent learning paradigm that solves the problem of lacking labeled data when applying ML to real-world data (zhuang2020comprehensive; pan2009survey). Specifically, TL improves the performance of a target model on target domains by transferring the knowledge in different but related source domains to the target domains.
Such properties make TL well-appreciated in real-world applications, such as healthcare (kao2021toward; song2021transfer; perez2021transfer) and recommender systems (tl_recom_www21; liu2021leveraging; tl_recom_cikm21).
According to the availability of labels in the source and target domains, TL is categorized into three types, i.e., transductive TL (aka Domain Adaptation (DA), redko2019advances; patel2015visual), inductive TL, and unsupervised TL (zhuang2020comprehensive; pan2009survey).
Few-Shot Learning (FSL)
FSL (fink2004object; fei2006one; wang2020generalizing) is a specific application case of TL. It aims at obtaining a model for the target task under a certain scenario where limited labeled samples from the target domain are available (wang2020generalizing). FSL is well-acknowledged in tackling different real-world problems such as identifying atypical ailments (quellec2020automatic; jia2020few), visual navigation (al2022zero; luo2021few), and cold-start item recommendation (sun2021mfnp; zhang2021model).
Meta-Learning
Meta-Learning (hospedales2021meta) is an implementation approach to achieve TL. The main concept is to obtain a meta-learner (a model) that can have satisfying performance for an unseen target domain (hospedales2021meta). Such meta-learner first extracts the meta-knowledge, i.e., the universally applicable principles, across source domains. With meta-knowledge, the meta-learner can be easily generalized to the target domain by leveraging the target samples. Meta-learning has been successfully applied in various problems such as hyper-parameter optimization (bohdal2021evograd; raghu2021meta), algorithm selection for data mining (simchowitz2021bayesian), and neural architecture search (NAS) (lee2021hardware; ding2022learning).
Though TL paradigms, including FSL and meta-learning, involve multi-domain data, their ultimate goal is to obtain a model with satisfied performance or can be easily generalized to one target task. In other words, TL leverages the knowledge in different tasks to assist the model in learning a single task, which intersects with MTL according to our definition in Definition 3. Thus, TL can bring merits to MTL, such as capturing the relations among tasks and extracting shared knowledge among involved tasks. Notably, the transfer of knowledge from pretrained foundation models (PFMs) proves beneficial for a myriad of downstream tasks in recent advancements (bommasani2021opportunities; zhou2023comprehensive).
Lifelong Learning
Lifelong Learning (parisi2019continual), aka Continual Learning, Sequential Learning, or Incremental Learning, studies the problem of learning from an infinite stream of data (de2021continual). The goal is to gradually extend the acquired knowledge and use it for future data, mitigating the occurrence of catastrophic forgetting or interference (mcclelland1995there). With only a small portion of the input data from one or few tasks available at once, lifelong learning particularly tends to preserve the knowledge learned from the previous input when learning on new data, i.e., addressing the stability-plasticity dilemma (grossberg2012studies). There are extensive applications of lifelong learning in solving tasks in ever-evolving systems, such as recommendations (chen2021towards; yao2021device) and anomaly detection (peng2021lime; doshi2022rethinking).
Lifelong learning differs from MTL in the sense that its training object is a dynamic data stream, while MTL studies data from multiple tasks available at the beginning of the learning process.
Multi-View Learning (MVL)
MVL (xu2013survey; zhao2017multi; li2018survey) studies the problem of jointly learning from multi-view data samples, whose goal is to optimize the generalization performance for the jointly learning model (li2018survey). In real-world applications, the multi-view data indicates objects being described by multi-modal measurements, such as image+text, audio+video, and audio+articulation. Multi-Instance Multi-Label learning (MIML) (zhou2012multi) is a specific subtype of MVL, where an example is described by multiple instances and associated with multiple class labels. Due to the vast existence of multi-view data in realistic, MVL has attracted much attention in both research and industry, and the respective solutions play essential roles in cross-media retrieval (zhen2019deep; huang2020forward), video analysis (wang2022cascade; zellers2021merlot), recommender system (wei2022contrastive; chai2022knowledge), etc. MVL, including MIML, can be considered a specialized form of MTL, where the input contains data from multiple domains that are handled as distinct tasks, but the output is still in one label space.
In-Context Learning (ICL)
ICL (dong2022survey) has aroused interest as a novel learning paradigm for natural language processing (NLP) within Large Language Models (LLMs). ICL relies on templates in natural language that can demonstrate different tasks, such as solving mathematical reasoning problems (wei2022chain) and learning natural language inference (NLI) (liu2021natural). LLMs can then make predictions by taking this demonstration and its corresponding query pair as input. While both ICL and MTL involve leveraging shared knowledge or context to enhance task generalizability, ICL is specifically tailored to the target task within a narrower scope in real-world applications. However, recent large PFMs, like GPT-4 (openai2023gpt4), are inherently task-agnostic, accommodating various tasks owing to the diversity of demonstration templates encountered during their large-scale training stage.
1.3. Motivation and Benefit
MTL can be motivated from the following five perspectives with different benefits: cognitive/social psychology, data augmentation, learning efficacy, real-world scenarios, and learning theory.
-
•
Psychologically, humans are inherent with flexible adaptability to new problems and settings, as the human learning process can transfer knowledge from one experience to another (national2000people). Therefore, MTL is inspired by simulating this process to empower a model with the potentiality of multitasking. Coincidentally, another example of this knowledge transfer happens among organizations (argote2000knowledge). It is proved that organizations with more effective knowledge transfer are more productive and likely to survive than those with less. These prior successes of transfers or mutualizations in other areas encourage the joint learning of tasks in ML (caruana1997multitask).
-
•
In the pre-big data era, real-world problems were usually represented by small but high-dimensional datasets ( samples features). This data bottleneck forces early methods to learn a sparse-structured model, which always leads to a parsimonious solution to a problem with insufficient data. However, the MTL emerged to aggregate labeled data from different domains or tasks to enlarge the training dataset against overfitting.
-
•
The pursuit of efficiency and effectiveness is also one of the motivations. MTL can aggregate data from different sources together, and the joint training process of multiple tasks can save both computation and storage resources. In addition, the potential of performance enhancement makes it popular in research communities. In brief, universal representations for any tasks can be learned from multi-source data, and benefit all tasks in terms of both the learning cost and performance.
-
•
Motivated by the majority of real-world problems naturally being multimodal or multitasking, MTL is proposed to remedy the suboptimal achieved by STL that only models parts of the whole problem separately. For example, predicting the progression of Alzheimer’s Disease (AD) biomarkers for Mild Cognitive Impairment (MCI) risk and clinical diagnosis is simultaneously based on multimodal data such as computed tomography (CT), Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) (jie2015manifold; kwak2018multi; chen2022machine). Autonomous driving, another example, also involves multiple subtasks to calculate the final prediction (yang2018end; chowdhuri2019multinet), including the recognition of surrounding objects, adjustments to the fastest route according to the traffic conditions, the balance between efficiency and safety, etc.
-
•
From the perspective of learning theory, bias-free learning is proved to be impossible (mitchell1980need), so we can motivate the MTL by using the extra training signals for related tasks. Generally, MTL is one of the ways to achieve inductive transfer via multitasking assistance, which improves both learning speed and generalization. Specifically, during the process of the combined training of multiple tasks, some tasks can be provided inductive bias from other related tasks, and these stronger inductive biases (compared with universal regularizers, e.g., ) enable the knowledge transfer and yield more generalization abilities on a fixed training dataset. In other words, task-related biases make a learner prefer hypotheses that can explain more than one task and prevent specific task from overfitting.
1.4. Mechanism and Explanation
In this section, we explore three key mechanisms – regularization, inductive bias, and feature sharing – shedding light on how MTL operates to achieve enhanced performance across multiple tasks.
Regularization
In MTL, the total loss function is a combination of multiple loss terms with respect to each task. The related tasks play a role as regularizers, enhancing the generalizability across them. The hypothesis space of an MTL model is confined to a more limited scope as it tackles multiple tasks simultaneously. Consequently, this constraint on the hypothesis space reduces model complexity, mitigating the risk of overfitting.
Inductive Bias
The training signals from co-training tasks act as mutual inductive biases due to their shared domain information. These biases facilitate cross-task knowledge transfer during training, guiding the model to favor task-related concepts rather than the tasks themselves. Consequently, this broadens the model’s horizons beyond a singular task, enhancing its generalization capabilities for unseen out-of-distribution (OOD) data.
Feature Sharing
MTL can enable feature sharing across related tasks. One approach involves selecting overlapping features and maximizing their utility across all tasks. This is referred to as “eavesdropping” (ruder2017overview), considering that some features may be unavailable for specific tasks but can be substituted by that learned from related tasks. Another way is to concatenate all the features extracted by different tasks together; these features can be holistically used across tasks via linear combination or nonlinear transformation.
Overall, MTL can be an efficient and effective way to boost the performance of the ML model on multiple tasks by regularization, inductive transfer, and feature sharing.
1.5. Contributions and Highlights
Existing Surveys. ruder2017overview is a pioneering survey in MTL, offering a broad overview of MTL and focusing on advances in deep neural networks from 2015 to 2017. thung2018brief reviews MTL methods from a taxonomy perspective of input-output variants, mainly concentrating on traditional MTL prior to 2016. These two reviews can be complementary materials to each other. vafaeikia2020brief is an incomplete survey that briefly reviews recent deep MTL approaches, particularly focusing on the selection of auxiliary tasks for enhanced learning performance. crawshaw2020multi presents the well-established and advanced MTL methods before 2020 from the perspective of applications. vandenhende2021multi provides a comprehensive review of deep MTL in dense prediction tasks, which generate pixel-level predictions such as in semantic segmentation and monocular depth estimation. zhang2021survey first give a comprehensive overview of MTL models from the taxonomy of feature-based and parameter-based approaches, but with limited inclusion of deep learning (DL) methods. Notably, all these surveys overlook the development of MTL in the last three or four years, named the era of large PFMs (bommasani2021opportunities; zhou2023comprehensive), exemplified by the GPT-series models (radford2018improving; radford2019language; brown2020language; openai2023gpt4).
Roadmap. This survey adopts a well-organized structure, distinguishing it from its predecessors, to demonstrate the evolutionary journey of MTL from traditional methods to DL and the innovative paradigm shift introduced by PFMs, as shown in Fig. 1. In § 2.1, we provide a comprehensive summary of traditional MTL techniques, including feature selection, feature transformation, decomposition, low-rank factorization, priori sharing, and task clustering. Moving forward, § 2.2 is devoted to exploring the critical dimensions of deep MTL methodologies, encompassing feature fusion, cascading, knowledge distillation, cross-task attention, scalarization, multi-objective optimization (MOO), adversarial training, Mixture-of-Experts (MoE), graph-based methods, and NAS. The recent advancements in PFMs are introduced in § 2.3, categorized based on task-generalizable fine-tuning, task promptable engineering, as well as task-agnostic unification. Additionally, we provide a concise overview of the miscellaneous aspects of MTL in § 3. § 4 provides valuable resources and tools to enhance the engagement of researchers and practitioners with MTL. Our discussions and future directions are presented in § 5, followed by our conclusion in § 6. The goal of this review is threefold: 1) to provide a comprehensive understanding of MTL for newcomers; 2) to function as a toolbox or handbook for engineering practitioners; and 3) to inspire experts by providing insights into the future directions and potentials of MTL.
2. MTL Models
Formalization
In machine learning, no matter the problem (discriminative, generative, adversarial, etc.), we hope to learn a predictive model by minimizing the regularized empirical loss as
(1) |
where is data pairs sampled from a single task, and includes weights of learning model . In general, measures the distance between the predictions and ground-truth, and adds constraints to the learning model, e.g., sparsity. The trade-off parameter controls the balance between the loss and penalty. Fig. 4(a) shows the detailed framework of STL. In comparison, as shown in Fig. 4(b), the optimization in MTL is conducted on the multiple loss functions to achieve joint learning, and each task can maintain a specific loss function. Accordingly, MTL considers the problem in the following:
(2) |
where denotes the number of tasks, and is the MTL model to be learned. In MTL, always encodes both task-specific and -shared representations, and builds task relatedness and reciprocity; both contribute to the effectiveness and efficiency of MTL.
I/O Configurations
To accommodate data in Eq. (2), it is necessary to consider various input/output (I/O) configurations that may impose constraints on the MTL modeling process. For instance, tasks such as semantic segmentation and depth estimation can utilize the same input images, and the applications are always developed using datasets where each image is attached with dense prediction labels for both segmentation and depth. On the other hand, when dealing with a digital recognition problem involving multiple domains (e.g., handwritten digits and license plate digits), different inputs are mapped to the same output space. We refer the former as a single-input multi-output (SIMO) configuration and the latter as a multi-input single-output (MISO) configuration. In MTL, the most prevalent scenarios reside in multi-input multi-output (MIMO) configuration where each task maintains its own set of samples and the labels are omnivorous, e.g., autonomous driving that involves pedestrian detection and traffic sign recognition. Let us denote the data input space and its corresponding label space for the -th task by and , respectively. We classify the MTL problems into three cases: SIMO, MISO, and MIMO. Fig. 5 shows the illustration of these three configurations. It is worth noting that the I/O configurations do not significantly impact the taxonomy of methods in MTL. As indicated in Table 2, there are numerous shared practices of applying different methods to these I/O configurations, as well as various data modalities and task types.
I/O | Data Modality | Task Type | ||||||||||
MTL Strategy | Assumption | SIMO | MISO | MIMO | Table | Image | Text | Graph | Regression | Classification | Dense Prediction | |
Feature Selection | 1 | ✓ | ✓ | ✓ | ✗ | |||||||
Decomposition | 1 | ✓ | ✓ | ✓ | ✓ | |||||||
Regularization | Low-Rank Factorization | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
Priori Sharing | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
Task Clustering/Grouping | 1 | ✓ | ✓ | ✓ | ||||||||
Group-Based Learning | 1 | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | |||||
Relationship Learning | Mixture-of-Experts | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
Feature Fusion | 2 | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Cascading | 2 | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Knowledge Distillation | 2 | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Feature Propagation | Cross-Task Attention | 2 | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ||
Scalarization | 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Multi-Objective Optimization | 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Adversarial Training | 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Optimization | Neural Architecture Search | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
Downstream Fine-tuning | 1 | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Task Prompting | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Pre-training | Multi-Modal Unification | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
✓ indicates common practice in the research community. ✗ indicates not applicable due to technical constraints.
Taxonomy
MTL has seen significant advancement prior to the DL era (caruna1993multitask; caruana1997multitask; bakker2003task; ando2005framework; obozinski2006multi; zhang2006a). Initially, there was a strong focus on weight/parameter regularization, including sparse learning for cross-task feature selection, low-rank learning to uncover underlying factors, and decomposition methods to capture informative components. These approaches, while innovative in integrating intuitive variations from existing methods (e.g., the regularizer derived from the classic regularizer), still face limitations in practical applications due to the idealistic assumptions and a lack of consideration for task relationships. The emergence of methods like task clustering, priori sharing, graph-based learning, and MoE marked a shift towards more effective task relationship modeling. With the transition to the DL era, the abundance of features learned from architectures like convolutional neural networks (CNNs) (fukushima1980neocognitron; lecun1998gradient), recurrent neural networks (RNNs) (werbos1988generalization; hochreiter1997long) and Transformers (vaswani2017attention; dosovitskiy2020image) spurred the exploration of feature propagation methods, such as feature fusion, cascading, knowledge distillation (KD), and cross-task attention, all crucial for leveraging multi-source features. Alternatively, optimization-based methods, including scalarization, MOO, adversarial training and NAS, focused on gradients to harmonize optimization directions across tasks. These methods, while not restricted by I/O configurations, are constrainted on the pre-defined number of tasks and the use of heterogeneous architectures. Pre-training techniques, which leverages TL, markes a significant advancement towards unified and versatile multitasking, breaking limitations related to data modalities, dimensions, task numbers, model architectures, etc. The only cost is the large computation resources to train a really large model that can accommodate multi-task distributions. The MTL models are accordingly organized into five categories: regularization, relationship learning, feature propagation, optimization, and pre-training. Each contains a series of topics arranged chronologically in § 2.1 (traditional ML era), § 2.2 (DL era), and § 2.3 (PFM era). All of these topics can be inferred from three self-evident assumptions (but have been extensively validated by empirical evidence) as below:
Assumption 1 (Parameter Relatedness).
Under the same hypothesis space, models learned to perform related tasks can exhibit similarities.
Assumption 2 (Feature Richness).
Given the same level of experience, expanding the number of tasks to be learned can enhance the richness of features.
Assumption 3 (Optimization Consistency).
Learning multiple related tasks jointly in a single model can ensure consistency in optimization directions for each task.
We acknowledge that the presented taxonomy is not exhaustive, and certain methods may be classified differently when viewed from a different perspective. For example, Task Tree (TAT) (han2015learning), a clustering MTL method, establishes task hierarchy by decomposing the parameter matrix into different component matrices for each tree layer; we discuss it within the context of clustering MTL (see § 2.1.6). We also acknowledge that some methods that may be of interest to readers may not be included in this survey due to similarities or oversight. We welcome paper recommendations and will update the survey on our project page accordingly.222https://github.com/junfish/Awesome-Multitask-Learning. In Table 2, we summarize their assumptions, common practice, and technical constraints of these topics in terms of I/O configuration, data modality, and task type.
2.1. Traditional Era: Provable but Restrictive
Notation | Description |
Scalars are denoted by plain lowercase or uppercase letters. | |
#object | The number of object, e.g. #task denoting the number of task. |
oder | A vector with entries, denoted by bold lowercase letters. |
A matrix with size , denoted by bold uppercase letters. | |
A tensor with size , denoted by bold calligraphic letters. | |
A set contains , where could be anything, e.g., scalar, vector, data pair, learner, etc. | |
The -th entry for vector . | |
oder | The -th entry of matrix . |
Element-wise product of and , which means the -th entry of is . | |
The -th column vector of matrix . | |
The -th row vector of matrix . | |
The identity matrix of size , which has ones on the diagonal and zeros elsewhere. | |
tr | The trace of a matrix , defined as the sum of its components on the diagonal. |
col | The column space of a matrix , which consists of all linear combinations of its column vectors. |
rank | The rank of matrix , defined as the maximum number of linearly independent column (or row) vectors of . |
vec | The vectorization of the matrix in the row-by-row stacking way. |
The pseudoinverse of a matrix . | |
The set of orthogonal matrices. | |
The column vectors of matrix are orthogonal. | |
The set of real symmetric matrices. | |
The subset of that contains positive semidefinit matrices. | |
The norm of a vector, calculated as the sum of the absolute vector values. | |
The norm of a vector, calculated as the square root of the sum of the squared vector values. | |
The norm of a vector, calculated as the maximum of the absolute vector values. | |
The norm, i.e., cardinality of a matrix, defined as the number of nonzero components. | |
The norm of a matrix, calculated as the maximum of the norm of the column vectors. | |
The norm of a matrix, calculated as its maximum singular value. | |
The Frobenius norm of a matrix, calculated as the square root of the sum of the squared matrix values. | |
The set of non-increasing ordered singular values of matrix . | |
The trace norm of a matrix, defined as the sum of its singular values, i.e., . | |
The norm of a matrix, calculated as the maximum of the norm of the row vectors. | |
The norm of a matrix, defined as the -norm of the vector whose components are -norm of ’s row vectors. | |
The norm of a matrix, defined as the sum of the absolute matrix components. | |
The norm of a matrix, calculated as the norm of the vector whose components are norm of the row vectors. | |
The norm of a matrix, calculated as the sum of the norm of the row vectors. |
To establish a unified formulation, we start the review of traditional methods by defining a common framework. The notations for subsequent discussions are summarized in Table 3. Building upon this, we initiate our discussion with multiple standard regression models for each task as a paradigm. The weights of these homogeneous models can be arranged into one weight matrix, catalyzing a series of MTL studies through matrix regularization techniques in the traditional era. We denote by our dataset across tasks. For each task indexed by , we are given samples with features, i.e., , and the corresponding response values .
The single-task setting of these multiple linear regression problems is
(3) |
where for any , is the error term independent of , and is determined by the system state for -th task. Each model is separately learned from independent samples .
A trivial simplification of the above linear regressions is that all tasks maintain the same feature size , thus leading to a natural idea of stacking weight vectors for these tasks: , where the matrix-based regularizers come into play. To estimate as , the MTL method minimizes the objective function:
(4) |
where we consider the weights of multiple models, i.e., , as a union, and denote by the -th column of . Normally, an identical loss function, e.g., mean squared error (MSE), is always applied to , which originates from the assumption of . To capture task relatedness from the Assumption 1 that multiple models are similar to each other, is designed to take various regularization forms in traditional MTL. The overview of regularization techniques used in the traditional ML era for MTL (will be discussed in the following subsections) is presented in Table 2.1.
Model Name | Origin | Year | Typ | Matrix Regularizer | Vector Formalization |
Regularized MTL | KDD | evgeniou2004regularized | Group regularization | Frobenius norm | |
Learning Multiple Tasks with Kernel Methods | JMLR | evgeniou2005learning | Priori Sharing | Adaptive penalty | s.t. |
Alternating structure optimization | JMLR | ando2005framework | Decomposition | Frobenius norm | , s.t. |
Multi-task feature selection | Tech. Rep.1 | obozinski2006multi | Group-sparse learning | norm | |
Multi-task Lasso | Thesis2 | zhang2006a | Group-sparse learning | norm | |
Multi-task feature learning | NeurIPS | argyriou2006multi | Group-sparse learning, feature learning | norm | , s.t. |
Convex multi-task feature learning | Mach. Lea. | argyriou2008convex | Feature learning | Adaptive penalty | s.t. tr, colcol |
Low rank MTL | ICML | ji2009accelerated | Low-rank learning | Trace norm | |
Convex ASO | ICML | chen2009convex | — | — | |
Dirty block-sparse model | NeurIPS | jalali2010dirty | Group-sparse learning, decomposition | norm norm | , s.t. |
Sparse multi-task Lasso | NeurIPS | lee2010adaptive | Group-sparse learning | Weighted norm weighted norm | |
\cdashline1-6 | Weighted norm weighted norm | , | |||
Adaptive multi-task Lasso | NeurIPS | lee2010adaptive | Group-sparse learning | adaptive penalty | |
Large margin multi-task metric learning | NeurIPS | parameswaran2010large | Priori Sharing | Frobenius norm | s.t. |
Hierarchical multitask structured output learning | NeurIPS | gornitz2011hierarchical | Priori Sharing | Frobenius norm | , where is the parent node. |
low-rank learning | |||||
Robust MTL | KDD | chen2011integrating | Decomposition, group-sparse learning, | Trace norm + norm | , s.t. |
Temporal group Lasso | KDD | zhou2011multi | Group-sparse learning | Frobenius norm + norm | |
Clustered MTL | NeurIPS | zhou2011clustered | task clustering | Clustering penalty + norm | |
, where is the #task in the -th cluster . | |||||
Decomposition, sparse learning, | |||||
Sparse and low rank MTL | TKDD | chen2012learning | low-rank learning | norm + trace norm | , s.t. |
Convex fused sparse group Lasso | KDD | zhou2012modeling | Group-sparse learning | norm norm | |
Adaptive multi-task elastic-net | SDM | chen2012adaptive | Group-sparse learning | norm Frobenius norm | |
Multi-level Lasso | ICML | lozano2012multi | Decomposition, sparse learning | norm + adaptive penalty | , s.t. |
Robust multi-task feature learning | KDD | gong2012robust | Decomposition, group-sparse learning | norm + norm | , s.t. |
Multi-stage multi-task feature learning | NeurIPS | gong2012multi | Sparse learning | Capped norm (zhang2010analysis) | |
Convex formulation for MTL | IJCAI | zhang2012convex | Priori sharing | Clustering penalty | trtr s.t. , tr |
Multi-linear multi-task learning | ICML | romera2013multilinear | Low-rank learning | Overlapped tensor trace norm | where is the mode- unfolding of tensor . |
Regularization approach to learn MTL | TKDD | zhang2014regularization | Priori sharing | Clustering penalty + norm | trln s.t. |
Multi-linear multi-task learning | NeurIPS | wimalawarne2014multitask | Low-rank learning | Scaled latent tensor trace norm | where is a tensor. |
Task Tree model | KDD | han2015learning | task clustering | norm | |
Reduced rank multi-stage MTL | AAAI | han2016multi | Low-rank learning | Capped trace norm (sun2013robust) |
-
1
This work is published in Technical Report, the Department of Statistics, UC Berkeley.
-
2
This work is published in Jian Zhang’s Ph.D. Thesis, CMU Technical Report CMU-LTI-06-006, 2006.
2.1.1. Feature Selection
The high-dimensional scaling (negahban2008joint) where the number of model weights is much larger than that of the observations/features, i.e., , arises in many real-world problems, leading it costly and arduous to seek effective predictor variables. Sparse learning with an regularizer that aims to identify a structure characterized by a reduced number of non-zero elements. This parsimonious solution ensures the retention and selection of the most effective and efficient subset of features tailored to the target task (tibshirani1996regression). In MTL, Assumption 1 underpins the development of all sparse learning models. Under the settings of sparse learning, this assumption posits that similar sparsity patterns in model parameters suggest the relatedness between tasks. As a result, sparsity patterns subtly represent task relatedness, underscoring a subset of common features derived from these limited samples. More benefits and efficacy of employing sparsity in MTL have been thoroughly assessed and discussed in lounici2009taking. In this section, our discussion of feature selection in MTL encompasses both the block-wise () and element-wise () approaches. Each approach maintains both shared and task-specific features, optimizing performance across all tasks. In the block-wise approach, tasks can differentiate themselves from others’ priorities by attributing distinct weights to the commonly selected features. Conversely, the element-wise approach allows tasks to highlight their distinct preferences on predictors by opting for specific features in addition to the shared ones.
Block-Wise Sparsity
Multi-Task Feature Selection (obozinski2006multi) is the first method to address the problem of joint feature selection across a group of related tasks. This method extends the regularization for STL to the regularization for MTL. The assumption for regularization scheme is that multiple related tasks have a similar preference for a few common features, which encourages a solution to share the sparsity pattern. Therefore, imposes a sparse penalty on the norms of the -dimensional weight vectors associated with each feature across tasks (i.e., row vectors of the weight matrix ). This is formulated as follows:
(5) |
which selects features globally via encouraging several feature-wise weight vectors across all tasks to be . The norm imposed on feature-wise weight vectors (i.e., ) before norm here is a magnitude measurement, which could be substituted by any other () norm (obozinski2006multi). This penalty term can be seen as a generalization of regularization when task number . To solve the problem (5), obozinski2006multi offers a block-coordinate descent optimization method to update the block of weight vector associated with each feature. liu2012multi proposes an accelerated algorithm by reformulating it as two equivalent smooth convex optimization problems.
Multi-Task Lasso (zhang2006a) extends the efficient regularizers via imposing norm to each feature-wise weight vector . Based on the assumption that the number of effective predictor features is much smaller than the total features, Multi-task Lasso learns a sparser structure by
(6) |
The use of enforces the procedure to take the maximum value of each feature-wise vector across all tasks. This is appropriate if relevant features are not shared by every task, and this situation frequently happens as the number of tasks grows. zhang2006a proves that this problem can be solved by an efficient convex optimization technique. Furthermore, a full spectrum of regularization (, especially) suitable for MTL is investigated and discussed. However, negahban2008joint prove that the use of can improve learning efficiency only if the overlap of feature entries across tasks is large enough (), as compared to the situation where each task learns Lasso problem separately.
Temporal Group Lasso (zhou2011multi) is an MTL formulation for predicting the disease progression, which considers time points of disease progression as related tasks. They first admit the limitation of task independence for the analytical solution to the ridge regression problem , where is identical and denotes the progression of disease across tasks (time points). To capture the temporal smoothness for the adjacent time points, Temporal Group Lasso adds the temporal smoothness term and feature selector term to form the formalization as
(7) |
where is the indication matrix for the incomplete data, i.e., for any if the target value of sample at the -th time point is missing and otherwise. It is noted that this problem can be easily solved by accelerated gradient method (AGM) (nesterov2013gradient) using SLEP (liu2009slep). However, to avoid the shrinkage of relevant features that would result in sub-optimal performance, zhou2011multi proposed a standard two-stage procedure to relax the regularization.
Adaptive Multi-Task Elastic-Net (chen2012adaptive) aims to address the problem of collinearity existing in the multi-task feature selection method. Inspired by elastic-net (zou2005regularization), a natural thought is to add another quadratic penalty to the sparse multi-task constraint , which forms the corresponding multi-task elastic-net problem as
(8) |
where the traditional mixed norm learns the same amount of regularization across all features. As discussed below in the adaptive sparse multi-task lasso (lee2010adaptive), it is promising to learn different regularization weights for each feature. However, unlike the application of eQTL detection (lee2010adaptive) where features on single nucleotide polymorphisms (SNPs) make it easier to incorporate prior knowledge for each feature (see Eq. (10)), the priors scaling the importance of adaptive weights for each feature are always unavailable in many real-world problems. chen2012adaptive proposes a three-stage algorithm to estimate the adaptive weights via using a data-driven method: (1) estimate the initial regression weights with uniform weight for each feature; (2) construct adaptive scaling weights according to the weights estimated in the first step, where is a fixed constant; (3) compute the final estimated parameters via the multi-task elastic-net with the adaptive scaling weights, i.e., .
Element-Wise Sparsity
Sparse Multi-Task Lasso (lee2010adaptive) allows feature-specific penalty magnitude by incorporating a set of priors with fixed scaling parameters. This method also generalizes the sparse group Lasso penalty (simon2013sparse) by suing both the and norms to perform joint block-wise and element-wise feature selection. Specifically, sparse multi-task Lasso proposes
(9) |
where and are the scaling weights for the and regularizers, respectively. There exist two advantages of this method: (1) Unlike previous work by obozinski2006multi; zhang2006a, which considers norm that learns block-wise sparsity well but overlooks element-wise sparsity within each feature group, sparse multi-task Lasso balances the and regularizers via and to achieve both simultaneously. (2) Unlike obozinski2006multi; zhang2006a, which treats every feature-wise weight vectors () equally, i.e., , the two scaling vectors in lee2010adaptive can be automatically learned from data. Furthermore, maurer2013sparse uses the regularizer on data preprocessed by a linear mapping function and provides bounds on the generalization error for both MTL and TL settings.
Adaptive Sparse Multi-Task Lasso (lee2010adaptive) is induced as a super-problem from above. This method adaptively incorporates prior knowledge on SNPs (brookes1999essence) to learn two scaling vectors and , which are defined as the mixtures of features on the -th SNP
(10) |
where is the -th feature of the -th SNP. Here, the component of in Eq. (9) denotes the number of minor alleles at the -th SNP of the -th sample. lee2010adaptive uses a directed graphical model as an elegant Bayesian tool to find the maximum a posteriori (MAP) estimate of all the above learnable weights, shown in Fig. 6. Then the conditional probability of weight matrix given and is
(11) |
where the normalization factor is upper-bounded by the inference of high dimensional multivariate Laplace distribution (gomez1998multivariate). Accordingly, lee2010adaptive proposes an alternating minimization approach that iteratively optimizes one of and by fixing another until convergence.
Convex Fused Sparse Group Lasso (cFSGL) (zhou2012modeling) considers a formulation that additionally allows the element-wise feature selection compared to the temporal group Lasso (zhou2011multi). cFSGL encourages the sparsity for joint feature selection across tasks and specific feature selection within a task. The formulation can be written as
(12) |
where is the fused Lasso penalty, and the combination of and is also known as the sparse group Lasso penalty (simon2013sparse). Thus, this problem with three non-smooth regularization terms can be solved by AGM via computing the decoupled proximal operator.
Multi-Stage Multi-Task Feature Learning (gong2012multi) represents a pioneering approach to address the sub-optimal solutions observed in prior convex sparse regularization problems. This sub-optimality can be attributed to the challenges in approximating regularization. In response to this limitation, the method introduces a non-convex formulation utilizing capped regularization for MTL:
(13) |
where is a threshold to tailor the norm of weight vectors, i.e., corresponding to each feature, and the term is a natural generalization of capped norm in zhang2010analysis; zhang2013multi. To solve this non-convex problem (13), gong2012multi proposed an efficient algorithm and investigated the estimation error bound of the resulting estimator.
2.1.2. Feature Transformation
Unlike the sparse learning methods discussed in 2.1.1, which assume direct use of observed features, feature transformation methods aim to combine and transform–rather than simply select–the raw features into new representations. This approach enables handling coarse-grained input data. Sparse learning in MTL builds task relatedness into model through sharing similar weight structure across multiple tasks, however, feature learning in MTL makes tasks be related to each other via enforcing a common underlying representation (argyriou2006multi). For example, yu2019towards points out that two tasks of aesthetic quality assessment and emotional recognition in digital image analysis share similar feature representations. Another example from caruna1993multitask; caruana1997multitask, as shown in Fig. 6(a), reveals that different tasks can synchronously learn from the same feature encodings in feedforward neural networks (FNNs).
Multi-Task Feature Learning (argyriou2006multi) linearly combines observations/features via introducing a transformation matrix , which can be extended to nonlinear combinations by using kernel methods. As formulated in the following,
(14) |
we need to estimate and from the data. The norm imposed on ensures that the transformed features, i.e., , with a fixed , would be collectively selected across tasks. To learn the transformed features, argyriou2006multi fixed to minimize the objective function (14) over under the orthogonal constraints. Even with this two-step iterated optimization algorithm to solve for and , solving the problem (14) is still a non-convex problem. Accordingly, it is transformed into an equivalent convex problem333It is also known as convex multi-task feature learning (argyriou2006multi; argyriou2008convex), which is mentioned in argyriou2006multi and further discussed in argyriou2008convex with the learning of non-linear features using kernel methods. as follows.
(15) |
dong2015multi first extends the neural machine translation to an MTL framework which shares a bidirectional recurrent representation with forward and backward sequence information, as shown in Fig. 6(b). Suppose we have different language pairs , for instance, from English to many other languages like French, Spanish, Dutch, and Portuguese, the probability of generating each translated word at time step is
(16) |
where is parameterized by a FNN, is the hidden state of a recurrent neural network at time step , and is a context vector calculated from a sequence of annotations , which is mapped from the original sentence by an encoder. More details of bidirectional sequence learning please refer to dong2015multi. After that, all annotations are collectively transformed by soft alignment parameters for each encoder-decoder to achieve cross-task communications.
2.1.3. Low-Rank Factorization
In MTL, as discussed before, information sharing among multiple tasks can be achieved by assuming that all the tasks are impacted by the same small subset of predictors. On the other hand, low-rank structures imposed on the coefficient matrices or tensors can induce a different type of information sharing among tasks, i.e., the tasks are affected by the predictors through a shared small set of latent variables or directions, which are extracted from the original feature space and are the most relevant subspace to the outcomes. Depending on the way of indexing multiple learning tasks, one can choose to organize the coefficient vectors from multiple learning tasks into a matrix of dimension or a tensor with a more delicate structure. In general, the multi-dimensional indices of tasks commonly imply that there are multi-layer relationships among multiple tasks, and the tensor form can help keep this inherent structure which allows leveraging information from different dimensions of task similarities.
Matrix Factorization
The most commonly seen situation is when we organize the coefficient vectors from multiple tasks into a matrix , and the rank penalized problem can be formulated as
(17) |
However, to minimize the rank of a matrix is NP-hard (vandenberghe1996semidefinite) due to the combinatorial nature of the rank function (ji2009accelerated; han2016multi). An alternative is to substitute the rank penalty with the trace of the rank for the symmetric positive semidefinite matrix (mesbahi1999semi), but it excludes non-symmetric or even non-square matrices in real-world applications. fazel2001rank generalized the trace heuristic to any matrix by introducing the trace norm (a.k.a, nuclear norm or Ky-Fun k-norm) (horn2012matrix), which is defined as the sum of a matrix’s all singular values (See Table 3).
Low Rank Multi-Task Learning (ji2009accelerated) first introduces the trace norm optimization problem into MTL, which yields a low-rank solution that maps to a low-dimensional feature subspace. The problem can be written as
(18) |
where denotes the trace norm of the weight matrix . The technical challenge for the problem above is the non-smooth nature of the trace norm, which makes it converge slowly ( is the iterations). ji2009accelerated developed an accelerated gradient method that boosts the learning process of trace norm minimization from to , even to with the help of Nesterov’s method (nesterov1983method). It is noticed that a dual reformation (pong2010trace) of problem (18) can make it more solvable. In fact, both the rank penalty and the trace norm can be written in a more general form where is a penalty function and is the -th largest singular value of . When , where is the indicator function, we get the rank penalty which is also the norm of the singular values. When , we get the nuclear norm penalty, i.e., the norm of the singular values. For , the properties of the norm of the singular values, i.e., the Schatten- quasi-norm penalty, have been investigated in rohde2011estimation.
Instead of using different power functions of singular values as penalty functions, there are some other variants of the nuclear norm penalty that can lead to more delicate learning of a low-rank matrix.
The rank of a matrix is defined by the count of its non-zero singular values, meaning that a lower rank corresponds to fewer non-zero singular values. Unlike penalizing all singular values, which the trace norm avoids, it is more desirable and reasonable. This is because the trace norm specifically shrinks only small singular values toward zero, contributing to a more focused and effective regularization approach. To leave the larger singular values un-penalized, Reduced Rank Multi-Stage Multi-Task Learning (RAMUSA) (han2016multi) considers the objective function with truncated trace norm (zhang2012matrix) as
(19) |
The parameter serves as a threshold of the singular value magnitude, and only those singular values smaller than will get penalized. When , problem (19) is reduced to the low-rank MTL problem (18). To address this non-convex problem, han2016multi introduce a multi-stage algorithm designed to learn a surrogate upper-bound function. Theoretical proofs affirm its capability for shrinkage, making it an effective approach to tackle the non-convex optimization challenge.
An alternative to the truncated trace norm to relieve the shrinkage on large singular values is the adaptive nuclear norm penalization proposed by chen2013reduced. The weights are used to adjust for the level of penalization on each singular value, which should be non-negative values and satisfy . The explanation is straightforward, i.e., the larger weights on the smaller singular values ensure a greater shrinkage towards 0, while the smaller weights on the larger singular values are helpful in reducing the shrinkage magnitude.
Low-rank methods are useful to achieve dimension reduction by learning a small set of latent variables. However, low-rank methods alone cannot identify which variables are truly predictive of the outcomes. To obtain a more interpretable model, one can assume that not all predictors are affecting the outcomes by adding a sparsity-inducing penalty in addition to a low-rank restriction. In the field of statistics, this line of research has received lots of attention, and variable selection can be achieved by adding a row-wise penalization on the coefficient matrix in a rank-restricted model. For example, chen2012sparse apply a group-lasso type penalty on the rows of the coefficient matrix. Similar work include bunea2012joint and she2017selective. One of the other forms of sparsity structure considered in low-rank models is sparse SVD discussed in chen2012reduced and uematsu2019sofar. Sparse SVD achieves predictor and response selection simultaneously. With a rank , SVD dissects the correlation between responses and predictors, i.e., the coefficient matrix, into orthogonal channels. The importance of each channel is measured by a singular value, and within each channel, the weights on predictors (responses) are in the corresponding right (left) singular vectors. The sparse SVD can achieve both SVD layer-specific sparsity pattern, by imposing sparsity on elements of each singular vector to find different subsets of predictors/responses that are making effects in each correlation pathway (chen2012reduced), and global variable selection, by shrinking all weights related to a certain variable contained in singular vectors to be zeroes (uematsu2019sofar).
Tensor Factorization
When we have multiple learning tasks that can be indexed by multi-dimensional indices, instead of stacking all the weight vectors into a matrix of dimension features tasks, keeping the structure of the index of tasks by saving the weight vectors into a tensor leads to MultiLinear Multi-Task learning (MLMT) (wimalawarne2014multitask). MLMT brings us with several advantages compared with the conventional MTL. Firstly, it allows us to keep the inherent structure of the learning tasks so that different dimensions of task similarities can be learned, and the higher-order structures among tasks can be recovered as well. What’s more, task imputation (i.e., TL) is made available with MLMT for tasks with no training data (wimalawarne2014multitask). The learning problem can be written as
(20) |
where is a tensor consisting of learning weights , and the total number of tasks .
To exploit task similarities at each dimension, similar to low-rank matrix-based MTL, a multilinear rank restriction can be imposed on the weight tensor. In romera2013multilinear, the authors directly incorporated the rank restriction into the learning task by using a low-rank Tucker decomposition (kolda2009tensor) of the weight tensor, and the Frobenius norms of Tucker decomposition components are added as regularizations to reduce overfitting. This optimization problem is solved by alternating minimization.
Alternatively, tensor trace norms are commonly used as a convex approximation to rank restrictions. However, not like the matrix rank, since a tensor rank has no unique definition, various trace norms are developed to fulfill different analysis demands for different anticipated information sharing mechanisms among tasks (zhang2022learning). With denoting a tensor trace norm, the learning task is
(21) |
where is the tuning parameter to control the magnitude of penalization.
In general, in the sense of Tucker decomposition or multi-linear SVD (tomioka2013convex; kolda2009tensor), tensor trace norms include two categories: the overlapped tensor trace norms and the latent tensor trace norms. The latent trace norm (tomioka2013convex; wimalawarne2014multitask) can be written as
(22) |
where are latent tensors of and denotes a flattened tensor along its th axis. Thus, the latent trace norm is the infimum of the summation of the matrix trace norm of flattened latent tensors of . To account for the heterogenous multilinear rank and dimensions, wimalawarne2014multitask propose a scaled latent trace norm by adding a weight to each component . It can identify the dimension with the lowest rank relative to its dimensionality . The overlapped tensor trace norm (romera2013multilinear) of a tensor is defined as the weighted sum of nuclear norm of its flattened tensors. With different ways of tensor flattening, the overlapped tensor trace norms have different forms, including the Tucker trace norm (romera2013multilinear) that is a convex combination of matrix trace norms of tensor flattening along each axis in the tensor and the Tensor-Train trace norm (oseledets2011tensor) that conducts tensor flattening along successive axes starting from the first axis. Given that the feature representation can be factorized into semantic basis vectors and linear coefficients mapping the basis vector space to the original feature vector space, yang2016deep introduce the utilization of low-rank tensors in MTL through deep representation learning.
Since most of the overlapped tensor trace norms only make use a subset of all possible flattening of a tensor that reflect different beliefs of the information sharing mechanism among tasks, to search for all the low-rank structures in a weight tensor and unify various overlapped tensor trace norms, zhang2022learning propose a Generalized Tensor Trace Norm (GTTN) which is the convex sum of matrix trace norms of all possible tensor flattening. The combination weights of matrix trace norms of tensor flattenings are treated as unknown variables in the optimization problem to accommodate different levels of importance of each flattening.
When nonlinear low-rank structures among tasks are expected to achieve better learning performance, zhang2022learning propose the nonlinear GTTN that firstly transforms the rows or columns of each flattened tensor nonlinearly via a neural network and then performs GTTN on the transformed parameters to capture the nonlinear low-rank structure among all the tasks. For models that are nonlinear in the data, signoretto2013learning also provide a kernel-based method for MLMT.
2.1.4. Decomposition
Task-relatedness can be learned based on the assumption that similar tasks share the same non-zero elements, and these tasks can acquire richer representations through transformation or low-rank regularization. The decomposition methods discussed in this section aim to capture multiple aspects of task-relatedness, such as sparsity and low-rankness, by decomposing model weights into a sum or product of distinct components. These components not only capture shared information but also task-specific information that benefits each task. The flexibility of decomposition techniques provides deeper insights into the nature of multitasking, enabling exploration of various combinations of regularizers suitable for different types of multitasking, including the incorporation of irrelevant or outlier tasks. However, decomposition methods have a limitation. The regularization applied to complex components may lead to non-smooth optimization problems involving a large number of variables, which can pose challenges in efficiently solving the devised decomposition problem. In the MTL setting, the general formalization of decomposition problems can be expressed as
(23) |
where the and are regularizers for the learning of different task-relatedness.
Form “”
The Dirty Block-Sparse Model (jalali2010dirty) is introduced by recognizing that block-sparsity regularizers () are influenced by the degree of feature overlap among tasks. Acknowledging the prevalence of dirty high-dimensional data444It refers to data that are not only high-dimensional (containing a large number of features or attributes) but also contain errors, inaccuracies, or misleading information. in many multi-task scenarios, this model adeptly addresses the challenges posed by explicitly permitting the decomposition of the weight matrix into element-wise sparse and block-sparse components:
(24) |
where the and are the -th columns of and , respectively. The norm learns an uneven sparse structure (obozinski2006multi; zhang2006a) while norm guarantees features that admit block-wise sparsity to be learned collectively across tasks (zhang2006a). jalali2010dirty proves that Eq. (24) can match Lasso () for no-sharing STL and for fully-sharing MTL, and it strictly outperforms both methods elsewhere, including the dirty setting.
Robust Multi-Task Feature Learning (rMTFL) (gong2012robust) can capture the task-shared features among relevant tasks and identify outlier tasks simultaneously. Specifically, the weight matrix for all tasks is first decomposed into two components. And then, gong2012robust impose the well-known penalty on the first component and the penalty on the second component. Formally, the proposed rMTFL can be formulated as
(25) |
where the penalty applied to the rows of the weight matrices captures shared information, as it selects the same non-zero elements across all tasks. Simultaneously, the penalty on the columns enforces the weights for outlier tasks to be constrained to zero. In gong2012robust, a theoretical bound is established to quantify the approximation accuracy of the optimization in relation to the true evaluation. Additionally, error bounds between the estimated weights of rMTFL and the underlying true weights are provided. It is important to note that this method is specifically applicable to MTL settings where some of the tasks are considered outliers.
Robust Multi-Task Learning (RMTL) (chen2011integrating) addresses real-world applications where certain tasks are irrelevant to other aggregated groups in MTL, impacting the learning performance of different tasks. RMTL is designed to capture task relatedness by learning a low-rank structure while identifying outlier tasks. This approach draws inspiration from previous research on group sparsity (obozinski2006multi; lee2010adaptive). It is formulated as a non-smooth convex optimization problem as
(26) |
Different from feature selection techniques, norm here is imposed on the columns of the weight matrix. This penalty aims to learn group sparsity of different tasks across all features. It enforces that the weights associated with outlier tasks are constrained to approach zero, thereby diminishing the negative influence of outlier tasks. The low-rank structure encoded in RMTL encapsulates the positive effectiveness, mitigating the impact of outlier tasks. This differs from hsu2010robust that focuses on learning both low-rank and sparse structures and provides a theoretically established and unique decomposition. RMTL, on the other hand, simultaneously learns both the low-rank and task-wise sparse structures through an accelerated proximal method (APM) (nemirovski1994efficient; nesterov1998introductory). The performance bound of this integrated approach is also proven.
Sparse and Low-Rank Multi-Task Learning (chen2012learning) also decomposes the weight matrix into a low-rank component and a sparse component. Unlike chen2011integrating that jointly optimizes both structures in the objective function, chen2012learning uses a trace norm constraint to implicitly encourage the low-rank structure. The formulation is
(27) |
It is proved to be the tightest convex surrogate function to the non-convex NP-hard problem with a cardinality regularization term ( norm) and a low-rank constraint. A general projected gradient scheme (boyd2004convex) is applied to solve this relaxed convex problem (27), which can also be accelerated using Nesterov’s method (nesterov1998introductory).
Form “”
Alternating Structure Optimization (ASO) (ando2005framework) aims to facilitate structural learning from multiple tasks. By introducing an auxiliary variable for each task such that , the problem is formulated as
(28) | s.t. |
The solution process for problem (28) comprises two steps: fixing and then . The first step involves a convex problem, easily addressed by classic optimization methods such as stochastic gradient descent (SGD). The second step can be tackled using singular value decomposition (SVD) along with a series of linear algebra transformations. However, it is important to note that the non-convex ASO algorithm is not guaranteed to converge to a global optimum and may encounter challenges like getting stuck in local optima.
Convex ASO (cASO) (chen2009convex) investigates the use of convex relaxations to improve the convergence properties of the algorithm and can converge to a global optimum. Firstly, an improved ASO (iASO) formulation is proposed as an initial non-convex problem
(29) | s.t. |
where the intercept is omitted in SVM learner for simplicity. In Eq. (29), the constraint terms effectively manage both task relatedness and model complexity. It is noteworthy that the traditional ASO formulation, represented Eq. (28), serves as a special case of iASO, irrespective of the loss function choices.
To address the non-convex iASO problem (29), based on the observation that minimizes the constraint terms, the formulation of the constraint term can be restructured as
(30) |
where and . Thus, the convex ASO formulation can be written as
(31) | s.t. |
The convex optimization procedures contain the alternating steps of the estimation of with the fixed and the estimation of with a fixed . Via the convergence analysis, it is proved that cASO (31) can converge to a global optimum (chen2009convex).
Multi-level Lasso, introduced by lozano2012multi, is an approach that relies on the decomposition of the regression coefficients into two components—one shared across all tasks and another designed to capture task-specific features. Specifically, lozano2012multi suppose that the “global” sparsity would be controlled by a part of the “main effect” variables. Thus, an alternative decomposition is proposed to satisfy the desired property by rewriting as
(32) |
where indicates the “effect” from the -th feature, and reflects task specificity. Accordingly, the optimization problem can be written as
(33) |
This model accommodates variations in support across multiple tasks while preserving common structures. The optimization process involves iteratively solving for either oder while keeping the other fixed, which is proved to be converged in lozano2012multi. The limitation is associated with the alternate optimization procedure of Multi-level Lasso. When learning while fixing , this process essentially becomes a classical Lasso problem, which is relatively easy to solve. However, obtaining the solution for the global problem can be time-consuming, as pointed out in friedman2007pathwise.
2.1.5. Priori Sharing
Multi-task priori sharing focuses on understanding and exploiting the relationships between different tasks to improve learning efficiency and performance. This approach is predicated on the idea that tasks, especially those that are related, can provide complementary information that enhances learning when approached collectively rather than in isolation. By identifying and leveraging the priori interconnections among tasks, priori sharing aims to achieve better generalization, more robust models, and improved predictions for each task.
The typical formulation of priori sharing in MTL is given in the same form as equation (4) This optimization objective function seeks to minimize a cumulative loss function over tasks, which is a summation of individual losses for each task’s predictions against its true values, adjusted by a global regularization term. The regularization term, is then applied to the combined weight vector which concatenates all task-specific weights , thereby incorporating shared information across tasks into the model. It is designed based on a priori knowledge of task interrelations and enforces certain structure of constraints on to reflect the assumed relationships between tasks within the model. This formulation allows for the integration of similarities and differences across tasks to inform the learning process, aiming to improve the generalization of the model by leveraging shared patterns and task-specific peculiarities. The categorization of multi-task prior sharing can be broadly understood in the following ways:
Task similarity. There is compelling evidence supporting the advantages of learning information from multiple task domains compared to single-task data. In earlier studies, such as evgeniou2004regularized, and parameswaran2010large, the formulation proposed by multi-task relationship learning was all generated based on prior assumptions of task relatedness. Specifically, evgeniou2004regularized, and parameswaran2010large assumed that the learning tasks are similar to each other and employed task-coupling parameters to model the target average task. In Regularized MTL (evgeniou2004regularized), task-coupling parameters were utilized to model the relationships between tasks and extend existing kernel-based single-task methods like support vector machine (SVM) through a novel kernel function. Their formulation is
(34) |
where represents sample size of data points for each task, represents the error for each estimation of parameter generated from the data distribution. They followed the formulation from Hierarchical Bayes (allenby1998marketing; arora1998hierarchical; heskes2000empirical) and described the target T functions as hyperplanes , where denotes each corresponding target model. In their approach, the authors assume that when learning from tasks that are similar to each other, the discrepancies between different tasks are small, and the task relationships are linked to a common model . Additionally, evgeniou2005learning and kato2007multi provide prior information on the similarities between pairs of tasks and incorporate regularization terms to adjust the learning of multiple tasks in a manner that aligns the distance between model parameters with the distance between tasks. Furthermore, gornitz2011hierarchical describes the relationship between tasks using a tree structure, and the model parameters learn the similarity from their parent nodes.
Task correlation. Nevertheless, simply assuming the relationship among tasks without evidence support is somewhat detrimental and may extrapolate the results. By proposing a model that learns task relatedness directly from the data, Bayesian models like bonilla2007multi defines prior information over all the unobserved functions for each task and adapts the model parameters regarding the task identities as well as observed information without giving much model assumptions. Particularly, they use multi-task Gaussian Process (GP) prediction techniques to model the correlation among tasks, the formulation is
(35) |
where they approach this problem by placing a GP prior over the latent functions to directly induce correlations between tasks, denotes the inter-task dependency via a positive semi-definite (PSD) matrix, denotes the covariance between input data points, and refers to the random noise of the -th task, is the vector of function values corresponding to . bonilla2007multi introduces a novel approach that employs a common covariance function for input features and a ’free-form’ covariance matrix for different tasks, offering significant flexibility in modeling diverse data forms and task relationship. Furthermore, the utilization of this ’free-form’ covariance matrix mitigates the need for extensive observed data, enhancing the efficiency of the method. To address the overfitting concern stemming from the point estimation approach in bonilla2007multi, zhang2010multi extended multi-task GP to a weight-space view for the multi-task process, incorporating an inverse-Wishart prior to modeling the covariance matrix. This adaptation helps mitigate overfitting and enhances the robustness of the method.
Task covariance. In addition to learning through task correlation and task similarities, zhang2012convex; zhang2014regularization introduced the concept of Multi-Task Relationship Learning (MTRL) by utilizing a task covariance matrix to capture task relatedness. Within the regularization framework, they derived a convex formulation for multi-task learning, enabling simultaneous learning of model parameters and task relationship. Their innovation lies in the application of a matrix-variate normal prior on the weight matrix , lending a structured prior, alongside certain likelihood functions, to guide the formulation of an objective function that seeks for a posterior solution maximizing the likelihood. The objective function they employed is
(36) |
where the optimization target they proposed can be expressed as the minimization of a loss function augmented by a regularization term scaled by that penalizes the Frobenius norm of , and an additional term scaled by involving the trace of , reflecting the matrix-variate normal prior. Here, denotes a positive definite matrix capturing task covariance, and its complexity is controlled through constraints ensuring its positive definiteness and bounded trace. This formulation has been established as jointly convex in , allowing for simultaneous optimization of model parameters and task covariance matrix.
In essence, their approach extends the principles of single-task learning with regularization while incorporating alternative optimization techniques to achieve a convex objective function. Further developments have extended this framework to enhance multi-task boosting (zhang2012multi) and multi-label learning (zhang2013multilabel), illustrating its adaptability and potential for a broad spectrum of applications. The approach also offers an interpretative angle from the viewpoint of reproducing kernel Hilbert spaces for vector-valued functions (ciliberto2015learning; jawanpuria2015efficient), showcasing its theoretical elegance and practical utility. Also, in the context of MTL with a considerable number of tasks, it becomes evident that not all tasks are equally interrelated; many display a tendency toward sparsity in their inter-task relationships. Recognizing that a task may not contribute meaningfully to every other task and that sparse task relationships can mitigate overfitting issues more effectively than dense relationships, there is a growing interest in models that can capture these sparse patterns. zhang2017learning pays attention to the elucidation of such sparse task relationships, and the objective function can be written as
(37) |
where corresponds to the feature mapping, and the learning task refers to . By adding an regularization on the covariance matrix , their proposed approach, termed the SParse covAriance based mulTi-taSk (SPATS) model, is designed to determine a sparse task covariance structure. This method embraces the regularization, renowned for promoting sparsity, within a regularization framework tailored for MTL. The convex nature of the SPATS model’s objective function facilitates the development of an efficient alternating optimization strategy to find the solution.
2.1.6. Task Clustering/Grouping
Task relationships can be elucidated through the clustering or grouping of associated tasks, whereby tasks within the same cluster exhibit greater similarities. Executing clustering algorithms at the task level proves particularly advantageous in scenarios with numerous tasks. Typically, task clustering requires leveraging shared structural information across tasks, such as task similarity or distance. These are termed horizontal methods contrasting with hierarchical methods that harness inherent task structures, such as tree formations, to achieve MTL. Task priori sharing and clustering are closely related as both share the commonness across tasks, but clustered structure is an unknown priori that needs to be learned. For example, the problem defined in Eq. (34) could also be equivalent to solving the following optimization problem (See proof in evgeniou2004regularized):
(38) |
where (see Eq. (34)). The second regularization term in Eq. (38) implies that all tasks are clustered into a single group, and the parameters across all tasks are constrained to exhibit maximum similarity. This special case shows that all tasks are clustered into one group. In practice, however, it is worth noting that certain related tasks might frequently be clustered into different groups.
Horizontal Methods
Clustered Multi-Task Learning (CMTL) (zhou2011clustered) assumes that multiple tasks in the same cluster are similar to each other, and provides the insights of inherent relationships between ASO (ando2005framework) and CMTL. Specifically, the CMTL is non-convex, and the proposed convex relaxation of CMTL is equivalent to an existing convex relaxation of ASO. The objective function of CMTL can be formulated as
(39) | s.t. |
where is the #task in the -th cluster .
Hierarchical Methods
TAsk Tree (TAT) (han2015learning) model is the first method for MTL to learn the tree structure under the regularization framework. By specifying the number of tree layers as , han2015learning utilizes matrix decomposition to learn model weights for each layer, i.e., . TAT devises sequential constraints on the distance between the consecutive weight matrices over tree layers. By combining the loss functions, its learning objective can be shown as:
(40) | s.t. |
where the hyperparameters indicate the importance of different tree layers, and and denotes the elementwise operation. This sequential constraint encourages a non-increasing order for the pair distance between tasks from bottom to top.
Model Name | Origin | Year | MTL Strategy | Backbone | Sharing | Modality | Task | Measurement | Loss Function | Availability1 |
TCDCN | ECCV | zhang2014facial | Early stopping | CNN | Hard | Image | Facial landmark detection/head pose estimation/ | Mean error (mErr) (burgos2013robust), | Mean squared error (MSE), | |
gender classification/age estimation/expression | failure rate (dantone2012real) | cross-entropy (CE) loss | Official | |||||||
recognition/facial attribute inference | ||||||||||
ACL- | ||||||||||
MTL-ML | IJCNLP | dong2015multi | — | RNN | Hard | Text | Multiple-target language translation | BLEU-4 (papineni2002bleu), Delta | CE loss | — |
Vanilla | Part-Of-Speech (POS)/Chunking/Combinatory | |||||||||
Cascading | ACL | sogaard-goldberg-2016-deep | Cascading | LSTM | Hard | Text | Categorical Grammar (CCG) Supertagging | F1 score, Micro-F1 score | CE loss | — |
Surface normals estimation (normals)/semantic | mErr/median error (medErr)/within in angular | |||||||||
Cross-stitch | segmentation (semseg), object detection/attribute | distance (within ), pixel accuracy (pixacc), | ||||||||
networks | CVPR | misra2016cross | — | CNN | Soft | Image | prediction | mIoU, fwIU, mAP | CE loss | Unofficial |
ASP-MTL (aka | Hard & | CE loss, adversarial loss, | ||||||||
AdvMTL) | arXiv | liu2017adversarial | Adversarial training | LSTM | Soft | Text | Text classifications | Error rate | orthogonality constraint | Official |
Cascading, adding | Part-Of-Speech (POS) tagging/chunking/parsing/ | Accuracy (acc), F1, MSE, unlabeled attachment | CE loss, softmax loss, | |||||||
JMT | EMNLP | hashimoto-jmt:2017:EMNLP2017 | constraints | LSTM | Soft | Text | semantic relatedness/textual entailment | score (UAS)/labeled attachment score (LAS) | KL-divergence | Unofficial |
Object detection/mask estimation/object | Mask regression loss, | |||||||||
MNCs | CVPR | dai2016instance | Cascading | CNN | Hard | Image | categorization | mAPIoU | softmax loss | Official |
FAFS | CVPR | lu2017fully | NAS | CNN | Hard | Image | person attribute classification | Acc/recall | CE loss | Official |
Hard & | ||||||||||
MRN | NeurIPS | long2017learning | Task conditioning | CNN | Soft | Image | classifications on different domains | Acc | CE loss | Official |
Depth/scene parsing/contour | rel (eigen2014depth)/RMSE/log10 mErr/ | CE loss, softmax loss | ||||||||
PAD-Net | CVPR | xu2018pad | Mutual distillation | CNN | Hard | Image | prediction/normals | acc with threshold (acc-), IoU/acc | Euclidean loss | — |
MTN | CVPR | liu2018multi | Adversarial training | CNN | Hard | Image | font/glyph, identity/pose/illumination | Recognition rate | CE loss, adversarial loss | — |
cross-task | rel/ | berHu loss (laina2016deeper), | ||||||||
TRL | ECCV | zhang2018joint | attention | CNN | Hard | Image | Depth estimation (depth)/semseg | RMSE/acc-, pixacc/mean acc/mIoU | CE loss, uncertainty loss | — |
MMoE | KDD | ma2018modeling | MoE | MLP | Hard & | Tabular | Income/education/marriage prediction, | Area Under the Curve (AUC) | CE loss | Unofficial |
soft | data | engagement/satisfaction in recommendation | ||||||||
Tabular | ||||||||||
Soft Order | ICLR | meyerson2018beyond | feature fusion | CNN, MLP | Soft | data, image | Classification, attribute recognition | mErr | CE loss | — |
classification/colorization/edge/denoised | ||||||||||
GREAT4MTL | arXiv | sinha2018gradient | adversarial training | CNN | Hard | Image | reconstruction, depth/normal/keypoint | Err, RMSE, | CE loss | — |
Sluice | Adding constraints, | Hard & | Chunking/entity recognition (NER)/semantic | |||||||
networks | AAAI | ruder2019latent | early stopping | LSTM | Soft | Text | role labeling (SRL)/POS tagging | Acc | CE loss | Official |
CNN, | NER/Entity Mention Detection (EMD)/Relation | F1 score/precision/recall, MUC/B3/CEAFe | ||||||||
HMTL | AAAI | sanh2019hierarchical | cascading | LSTM | Hard | Text | Extraction (RE)/Coreference Resolution (CR) | (moosavi2016coreference) | CE loss | Unofficial |
CNN, | Segment labeling/Named Entity Labeling | CRF loss, CE loss, ranking | ||||||||
DCMTL | AAAI | gong2019deep | cascading | LSTM | Hard | Text | (NEL)/slot filling | F1 score/precision/recall | loss (vu2016bi) | Official |
Normals/semseg, age estimation/gender | mErr/medErr/within , mIoU, pixacc, mean/ | |||||||||
NDDR-CNN | CVPR | gao2019nddr | feature fusion | CNN | Soft | Image | classification | median absolute error (absErr), acc | CE loss | Official |
cross-task | RMSE/rel/acc with , mErr/medError/within | CE loss, loss, berHu loss | ||||||||
PAP | CVPR | zhang2019pattern | attention | CNN | Hard | Image | Semseg/depth/normals | , mIoU/mean accuracy (mAcc)/pixacc | affinity loss (zhang2019pattern) | — |
MTN | Hard & | Semseg/depth/normals, 10 classifications | mIoU/pixacc, mErr/medErr/within , | |||||||
(& DWA) | CVPR | liu2019end | Adaptive weighting | CNN | Soft | Image | (visual domain decathlon2 ) | absErr/real error, accuracy | CE loss, loss, dot product | Official |
Semseg/depth/edge/normals/human parts/ | mIoU/osdF/mErr/maximum F-measure (maxF)/ | |||||||||
ASTMT | CVPR | maninis2019attentive | attention, single-tasking | CNN | Hard | Image | saliency estimation/albedo | RMSE/ | CE loss, loss | Official |
ML-GCN | CVPR | chen2019multi | Graph based | CNN, GCN | Hard | Image | Multi-label recognition | precision, recall, F1 | CE loss | Official |
RD4MTL | arXiv | meng2019representation | Adversarial training | CNN | Hard | Image | Classifications | Acc | CE loss, adversarial loss | Official |
MTL-NAS | CVPR | gao2020mtl | NAS | CNN | Adaptive | Image | Semseg/normals, object classification/scene | mErr/medErr/Within , mIoU/pixacc, | CE loss, loss | Official |
classification | Acc | |||||||||
Semseg/edge/depth/keypoint detection (point), | ||||||||||
BMTN | BMVC | vandenhende2019branched | NAS | CNN | Adaptive | Image | attribute classification | mIoU, pixacc, , Acc | CE loss, loss, | Official |
PSD | CVPR | zhou2020pattern | Distillation | CNN | Hard & soft | Image | Semseg/depth/normals | RMSE/rel/acc with , mIoU/mean accuracy/ | CE loss, loss, berHu loss | — |
pixacc, mErr/medErr/within | ||||||||||
ECCV | distillation | Hard & | mIoU/pixacc, absErr/rel, mErr/medErr/within | |||||||
KD4MTL | Workshop | li2020knowledge | knowledge | CNN | soft | Image | Semseg/depth/normals, classification | , Acc | CE loss, loss, dot product | Official |
MTI-Net | ECCV | vandenhende2020mti | multi-task | CNN | Hard & | Image | Semseg/depth/edges detection (edges)/normals/ | mIoU, RMSE, mErr, optimal dataset-scale F- | CE loss, loss | Official |
distillation | Soft | saliency estimation/human parts | measure (odsF) (martin2004learning), | |||||||
NAS, | Regression, face attribute prediction, semseg/ | |||||||||
LTB | ICML | guo2020learning | task grouping | CNN | Soft | Image | normals/depth/keypoints/edges | Acc, CE, cos, mean absErr | CE loss, loss, cosine loss | — |
CNN & | ||||||||||
AAMTRL | ICML | mao2020adaptive | adversarial training | LSTM | Hard | Text | Classifications | Relatedness evolution, acc, influence of #task | Any 1-Lipschitz loss | — |
Hard & | Tabular | Sub-tasks in the recommendation systems, | ||||||||
CGC & PLE | RecSys | tang2020progressive | MoE | MLP | soft | data | income/education/marriage prediction | AUC/MSE, MTL gain | CE loss, loss | Unofficial |
TSNs | ICCV | sun2021task | task relationship learning, | CNN | Hard | Image | Semseg/depth/edges/normals/ | mIoU, RMSE, mErr, odsF, | CE loss, loss | Official |
task conditioning | saliency estimation/human parts | |||||||||
knowledge distillation, | Classification/detection/semseg/depth/ | |||||||||
MuST | ICCV | ghiasi2021multi | task conditioning | CNN | Hard | Image | normals | Acc, mIoU, RMSE, odsF | CE loss, loss | — |
AuxSegNet | ICCV | xu2021leveraging | cross-task | CNN | Hard & | Image | Semseg/classification/saliency detection | mIoU/precision/recall | Multi-label softmax | Official |
attention | Soft | loss, CE loss | ||||||||
cross-task | Hard & | Semseg/depth estimation/edges/normals/ | ||||||||
ATRC | ICCV | bruggemann2021exploring | attention | CNN | soft | Image | saliency estimation/human parts | mIoU, RMSE, mErr, odsF, maxF, | CE loss, loss | Official |
DSelect-k | NeurIPS | hazimeh2021dselect | MoE | MLP, CNN | Hard & | Tabular | engagement/satisfaction task, classification | Total loss, Acc, AUC/RMSE, #expert | CE loss, loss | Official |
soft | data, Image | |||||||||
Hard & | 16 Language understanding tasks, e.g. textual | Acc, Spearman correlation spearman1961proof, Matthews | ||||||||
MT-TaG | ArXiv | gupta2022sparsely | MoE | Transformer | soft | Text | entailment, sentiment classification, etc. | correlation coefficient (matthews1975comparison) | CE loss, MSE | — |
Hard & | Tabular | |||||||||
CrossDistil | AAAI | yang2022cross | distillation | MLP | soft | data | Finish watching/like | AUC, multi-AUC (hand2001simple) | CE loss | — |
MulT | CVPR | bhattacharjee2022mult | cross-task | CNN & | Hard | Image | Semseg/depth/reshading/normals/ | MTL gain, mErr of | CE loss, loss, | Official |
attention | Transformer | keypoints/edges | domain generalization | rotate loss (zamir2018taskonomy) | ||||||
cross-task attention, | CE loss, loss, cross- | |||||||||
task balancing (spec., | Hard & | Semseg/depth/saliency detection/ | task contrastive loss, | |||||||
MTFormer | ECCV | xu2022mtformer | kendall2018multi) | Transformer | soft | Image | human parts | mIoU, RMSE, | uncertainty loss | — |
MQTransformer | arXiv | xu2022multi | cross-task | Transformer | Hard & | Image | Semseg/depth/edges/normals/saliency | mIoU, RMSE, mErr, odsF, maxF | CE loss, loss | — |
attention | Soft | estimation/human parts | ||||||||
Image, | ||||||||||
MetaLink | ICLR | cao2022relational | Graph based | MLP, GNN | Hard | Graph | Classification | mAP, ROC AUC | CE loss | Official |
DeMT | AAAI | zhang2023demt | cross-task | CNN & | Hard & | Image | Semseg/depth/edges/normals/saliency | mIoU, RMSE, mErr, odsF, maxF, | CE loss, loss | Official |
attention | Transformer | Soft | estimation/human parts | |||||||
cross-task | Hard & | CE loss, berHu loss, cosine | ||||||||
mTEB | WACV | lopes2023cross | attention | CNN | soft | Image | Semseg/depth/normals/edges | , mIoU, RMSE, mErr, F1 | loss (guizilini2021geometric) | Official |
OKD-MTL | WACV | jacob2023online | distillation, task | Transformer | Hard & | Image | Semseg/depth/normals | , mIoU/pixacc, absErr/rel, mErr/medErr/ | Adaptive feature distillation loss, | — |
weighting | Soft | within | CE loss, loss, cosine loss | |||||||
Hard & | ||||||||||
AdaMV-MoE | ICCV | chen2023adamv | MoE | Transformer | soft | Image | classification/detection/Seg | Acc, Average Precision (AP) | CE loss | Official |
-
1
This column provides the link to the implementation or execution. Click on "Official" or "Unofficial" to access the website.
-
2
Part of PASCAL in Detail Workshop Challenge, CVPR 2017, July 26th, Honolulu, Hawaii, USA. https://www.robots.ox.ac.uk/vgg/decathlon.
-
3
We use “state" here to represent the domain of reinforcement learning, including the observations of states of environment, the positions of object, the actions made by agent, etc.
-
4
The average rank of MTL on all different tasks. MR = 1 if a method ranks first across all tasks.
2.2. DL Era: Effective and Diversified
With the advent of DL, more powerful computational units and more effective memory bandwidth, e.g., Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), have made it possible to learn richer features for challenging tasks. Deep MTL methods, unlike traditional MTL methods imposing parameter regularizations or decompositions, can handle large-scale parameter sharing, feature propagation, NAS, task balancing, and optimization intervention, to name a few. The traditional techniques often involve complicated mathematical analysis but fail to learn a satisfactory performance in the real-world scenario with noise-polluted data or loosely-related tasks. However, deep MTL methods can overcome these issues by (1) directly extracting features in raw data and gradually elevating features layer-by-layer from low-level textures to mid-level semantics to high-level responses; and (2) progressively learning activations by stochastic gradients descent (SGD) (robbins1951stochastic; lecun2002efficient) that is provably efficient and practical in obtaining an expressive networks (livni2014computational). In this manner, hierarchical features can be efficiently communicated at different levels for jointly learning of multi-task objectives.
This section begins with a discussion of the architecture taxonomy commonly adopted in deep MTL, which serves as the backbone for the rest of the method overview. In the following, we summarize the feature propagation techniques that include feature fusion (see § 2.2.1), cascading (see § 2.2.2), distillation (see § 2.2.3), and cross-task attention (see § 2.2.4). These techniques encourage networks to automatically combine the features learned from different tasks, addressing the crucial challenge of effectively and efficiently utilizing the rich features enabled by DL. § 2.2.5 presents an overview of task balancing techniques in deep MTL, incorporating the linear combination of different tasks through three essential factors: gradient, loss, and learning speed. The comparison and recalibration of these factors aim to coordinate diverse tasks during the model weight update process. We will discuss this section from the point of gradient correction and dynamic weighting. In contrast, § 2.2.6 explores MOO in the context of MTL, which aims to simultaneously optimize potentially conflicting objective functions. Other promising topics covered include adversarial multi-task training (see § 2.2.7), MoE (see § 2.2.8), GCN-based MTL (see § 2.2.9), and NAS for MTL (see § 2.2.10). The summary of deep MTL models is presented in Table 5, and representative DL frameworks in MTL are illustrated in Fig. 8.
Architecture Taxonomy
The remarkable success of deep MTL can be attributed to the rich extracted representations and their efficient sharing. Multi-task sharing relies on the basic splitting ways of architectures among involved tasks. liu2016recurrent first discuss three different sharing mechanisms based on text classification in Recurrent Neural Networks (RNNs): uniform-, coupled-, and shared-layer architectures. ruder2017overview first organize it into two categories: hard parameter sharing and soft parameter sharing. According to this taxonomy, the uniform-layer architecture falls under hard-parameter sharing, while coupled- and shared-layer architectures are considered soft-parameter sharing. In general, ruder2017overview’s taxonomy has been widely accepted by the research community (vandenhende2021multi). We carry forward this taxonomy and enrich it with more details.
In hard parameter sharing, as shown in Fig. 8(a), different tasks can share identical parameters in shallow layers and maintain their own specific parameters in the splitting heads. As shown in Fig. 6(a), this idea can be dated back to 1990s (bromley1993signature; caruanamultitask; caruana1997multitask) when high-related tasks are introduced into a shared FNNs to serve as inductive bias for each other. Fig. 6(b) shows this idea used in RNNs in a modern way (dong2015multi). CNNs can also adopt hard parameter sharing to perform multiple related tasks. As shown in Fig. 10, TCDCN (zhang2014facial) and Fast RCNN (girshick2014rich; girshick2015fast) are the earliest practice of this idea in computer vision. From a representation learning perspective, shallow layers are typically shared as a feature encoder that extracts common features such as edges and textures. By enriching these common features with more related tasks, deeper layers can help enable multitasking on task-specific heads.
misra2016cross argue that there is no principled way of architecture splitting in hard parameter sharing, and conducted the first empirical study to investigate the performance trade-offs amongst varieties of involved tasks and splitting ways in CNNs. The dependence between involved tasks and the splitting ways of architecture motivates the exploration of an architecture that can capture all possible splittings and thus learn an optimal combination of task-shared and task-specific representations, i.e., soft parameter sharing shown in Fig. 8(b). While hard-parameter sharing requires shallow layers to be identical across tasks, soft-parameter sharing encourages each task to maintain its own shallow layers and leverage features from related tasks during the propagation to capture similarities. These feature propagation techniques include but are not limited to fusion, aggregation, attention, etc. However, whether employing hard or soft parameter sharing, exploring the MTL architecture space still remains error-prone.. First of all, this space for deep neural architectures grows exponentially with depth, and incorporating more tasks significantly expands the range of optimal solutions. On the other hand, while hard parameter sharing compresses the model size, leading to a sub-optimal solution, soft parameter sharing ensures advancement by maintaining the maximum total model size, allowing each task to learn a specific architecture in contrast to STL. An adaptive architecture search in a greedy manner during the neural network training process shows promise. As shown in Fig. 8(c) the adaptive parameter sharing, each path from the different layers of different tasks is active before training. The connections vanish with the pursuit of model compression in the process of multi-task optimization, and usually, a thin network is finalized after this dynamic branching procedure.
Notation | Description |
Batch size. | |
Learning rate. | |
Feature maps output from -th layer of -th task, where are (batch size,) #height, #width, and #channel. | |
Convolution filter, where denotes the size of filter, and denote the number of input and output channels, respectively. | |
Exponential function. | |
Sigmoid function, where . | |
Softmax function, where for any entry index . | |
An arbitrary similarity function, e.g. cosine similarity cos. | |
The element-wise dot product. | |
Layer norm. | |
Multi-head self-attention operator. | |
Convolution operation parametrized by . | |
Reshape operation to rearrange the original feature maps in space into a new space. |
Unless explicitly stated otherwise, we employ the notation provided in Tab. 6 within the context of DL settings to expand upon and complement the information presented in Tab. 3.
2.2.1. Feature fusion
Feature fusing is a common technique used in MTL to fuse features extracted under the supervision of different tasks, which can leverage shared and private knowledge across tasks. This technique allows each network to better exploit the relationships between tasks and thus improve overall performance. In general, feature fusion in MTL involves weighted summation, concatenation, or a combination of both. We categorize the feature fusion methods into two classes: parallel sharing, where the feature fusion happens at the same position of layers between tasks, and Non-parallel sharing, in which the permutation of sharing layers may exist. The representative works in the line of parallel sharing include Cross-Stitch Networks (misra2016cross), Sluice Networks (ruder2019latent), and Neural Discriminative Dimensionality Reduction in Convolutional Neural Networks (NDDR-CNN) (gao2019nddr). As research in this direction progresses, an increasing number of learnable parameters are being used to control the fusion process. For example, Cross-Stitch Networks utilize four task-aware parameters, Sluice Networks capture latent subspaces of features via extra parameters, and NDDR-CNN models layer-wise fusion by using convolutions. However, expecting task feature hierarchies to align perfectly, even among closely related tasks, is unreasonable. Imposing parallel sharing in these unmatched layers could lead to negative transfer. To remedy this dilemma, Soft Order (meyerson2018beyond) uses a more flexible ordering of shared layers to assemble them in different ways for different tasks.
Parallel sharing. Cross-Stitch Networks (misra2016cross) is a soft parameter-sharing architecture that can learn an optimal combination of task-shared and task-specific representations via four learnable parameters, which is named cross-stitch unit. As shown in Fig. 7(a), the activations from different tasks are linear combined via four parameters . We denote by the feature maps in the -th layer of task . Then the formalization of the Cross-Stitch unit is
(41) |
Specifically, the extreme setting of can make certain layers to be non-sharing. From this perspective, the separate STL is a special case of cross-stitch combinations. By varying and values, this proposed unit can move between task-shared and -specific representations, and even choose a middle ground if necessary.
Sluice Networks (ruder2019latent) learns shared parameters between two BiLSTM-based sequence labeling networks (plank2016multilingual). This work aims to model loosely related tasks with non-overlapping datasets. As shown in Fig. 7(b) a sluice meta-network with two tasks, of which each layer is partitioned into two orthogonal subspaces and . Accordingly, the activations in the -th layer of task are also partitioned into and , thus leading to a matrix in to combine activations from two tasks:
(42) |
Inspired by Cross-stitch networks, these values are learnable to control how much to share for task-shared information and how much to preserve for task-specific information. Finally, parameter (see Fig. 7(b)), through the skip-connections, linearly summarizes the multi-task representations at various levels of the network architecture.
Neural Discriminative Dimensionality Reduction in Convolutional Neural Networks (NDDR-CNN) (gao2019nddr) further concatenates feature maps from different tasks in a channel-wise manner. This NDDR, as shown in Fig. 7(c), can be fulfilled by using simple convolutional layer plus batch nomalization layer, and be extended to any end-to-end training CNN in a “plug-and-play” fashion. Considering the number of tasks being , we can denote convolution by , where is the depth of combined feature maps from all tasks. We concatenate feature maps according to the channel dimension and divide convolution according to the output dimension by tasks as follows:
where and . Then, the output feature maps at the -th layer for the -th task can be calculated as
(43) |
The NDDR layer defined by Eq. (43) is a standard convolution operation in CNNs. To avoid a trivial solution on and the noise directions of learned features, the batch normalization layer is followed after each NDDR layer, and the weight decay is applied on the weights of the NDDR layer, respectively.
Unparallel sharing. Soft Order (meyerson2018beyond) learns how shared layers are assembled in permuted ways for different tasks. Specifically, a learnable tensor of scalars , is used to implement the soft ordering, where is #layer and is #task. For simplicity, consider a hard sharing network with shared layers ( can be or Linear function), then the soft ordering of this hard sharing for the -th task is:
(44) |
where is the -th entry of the tensor . Fig. 7(d) visualizes this layer permutation operation. It is noticed that the constraint on for can be easily implemented via a softmax function. In practice, a dropout operation is beneficial to increasing the generalization capacity of shared representations.
2.2.2. Cascading
Having supervision from all tasks at the outermost level is shown to be sub-optimal, another avenue of investigation for mitigating this parallel sharing is through the implementation of multi-task cascaded learning (sogaard-goldberg-2016-deep). This field of study involves supervising tasks at different levels within their respective layers, facilitating higher-level tasks to effectively leverage the shared representation derived from lower-level tasks. In practice, multi-task cascading can be applied to 1) the complicated task that can be decomposed into several sub-tasks, e.g., instance-aware semantic segmentation decomposed into differentiating instances, estimating masks and categorizing objects in CV (dai2016instance), and 2) a group of hierarchical tasks, e.g., part-of-speech (POS) tagging (word-level), dependency parsing (syntactic-level) and question answering (QA) (semantic-level) in NLP (sogaard-goldberg-2016-deep; hashimoto-jmt:2017:EMNLP2017). In this line of research, early work (sogaard-goldberg-2016-deep) realize cascading by having low-level tasks supervised at shallow layers, and then reusing representations from shallow layers for higher-level tasks. The Joint Many-Task (JMT) model (hashimoto-jmt:2017:EMNLP2017) adds shortcut connections from each lower-level task prediction to higher-level tasks, which can further reflect task hierarchies. Furthermore, shortcut connections in Multi-task Network Cascades (MNCs) (dai2016instance) and Deep Cascade Multi-Task Learning (DCMTL) (gong2019deep) come from both cascade connection (predictions) and residual connection (features). Hierarchical MTL (HMTL) (sanh2019hierarchical) introduces more semantic tasks to share both common embeddings and encoders in a hierarchical cascading architecture.
Vanilla Cascading (sogaard-goldberg-2016-deep) first presents a multi-task learning architecture that utilizes bi-directional RNNs. This architecture enables the supervision of different tasks at various layers, as shown in Fig. 10(a). In this study, the POS task is supervised at the innermost layer, and the syntactic chunking and Combinatory Categorical Grammar (CCG) supertagging join at the outermost layer to utilize the shared representation of the lower-level tasks via a hard parameter sharing. In this case, the incorporation of lower-level task supervision affects the shallow layer parameter updating, which is beneficial to all involved tasks in MTL.
Multi-task Network Cascades (MNCs) (dai2016instance) performs three sub-tasks of the instance-aware semantic segmentation at the different stages and reuses the features of these tasks at different layers. Each of the three stages involves its own predictions of box-level instance proposals, mask-level instance regression, and instance categorization, respectively, and the later task learning relies on previous prediction output. As shown in Fig. 10(b), the innermost features are utilized by all sub-tasks, which is beneficial to both the accuracy and speed in an end-to-end training manner.
Joint Many-Task (JMT) Model (hashimoto-jmt:2017:EMNLP2017) is another cascading model to predict NLP tasks with different linguistic levels of morphology, syntax, and semantics. JMT shares a similar architecture with MNCs, as shown in Fig. 10(c), but each higher-level task contains the shortcut connections from the predictions of all lower-level tasks. In addition, the naïve regularization term is imposed on model weights to allow the improvement of one task without exhibiting catastrophic interference with the other tasks.
Deep Cascade Multi-Task Learning (DCMTL) (gong2019deep) first incorporates both cascade and residual connections. As shown in Fig. 10(d), the cascade connections transmit predictions from lower tasks, while the residual connections transmit inputs from lower layers. It has been validated that these skip connections are effective for strictly ordering tasks. The cascading structure alone proves inadequate for high-level tasks that heavily rely on low-level tasks. In addition, DCMTL can outperform previous SOTA methods and has been deployed on the online shopping assistant of a dominant Chinese E-commerce platform.
Hierarchical Multi-Task Learning (HMTL) (sanh2019hierarchical) is a parallel method trained in a hierarchical fashion. This model can supervise a set of low-level tasks at the bottom layers and more complex tasks at the top layers. Similar to MNCs (dai2016instance), representations extracted at the very beginning are fed into all the successive encoders for different tasks, which is beneficial to the training stability and acceleration. Also shown in Fig. 10(d), HMTL is a variation that parallels high-level tasks could exist, e.g., Coreference Resolution (CR) and Relation Extraction (RE), and more types of word representations like pre-trained GloVe (pennington2014glove) and ELMo (peters-etal-2018-deep) embeddings, are combined to achieve the best performance.
2.2.3. Knowledge Distillation (KD)
Motivated by KD (44873) where a teacher model can guide a student model via passing meaningful knowledge (e.g., soft labels), separate models in MTL for different tasks can utilize definite information. Specifically, a teacher model can be trained on multiple tasks that are of interest and then serves as an expert in performing those tasks and possessing versatile knowledge. The knowledge from the teacher model is then transferred to a student model. This can be done by training the student model to mimic the behavior of the teacher model, e.g., the student model learns to predict the outputs or pattern structures of the teacher model on the shared tasks. On the other hand, the student model can be trained jointly on multiple tasks, using both the labeled data for each task and the guidance from the teacher model. The shared information and generalizable representations learned from the teacher model can benefit the student model’s performance on all the tasks. In this manner, the teacher model performs auxiliary tasks to assist the student model in target tasks. For example, the depth prediction from a customized CNN can help the segmentation task via multi-modal distillation (i.e., train with RGB-Depth data instead of RGB data), while the depth prediction is an intermediate auxiliary task to the target segmentation task (xu2018pad). The research in this subfield can be classified into two categories that correspond to the knowledge encompassed within a teacher model: feature-level and response-level. KD4MTL (li2020knowledge) carries forward FitNets (romero2014fitnets) via optimizing the distance between the features of the offline task-specific networks and the online multi-task network. MuST (ghiasi2021multi) and OKD-MTL (jacob2023online) distill the knowledge (i.e., pseudo labels) from pre-trained specialized teachers to general-purpose students. MuST (ghiasi2021multi) pretrains several specialized teachers capable of generating multi-task labels for the target dataset. CrossDistil (yang2022cross) distills the responses of item preference across different tasks in the recommender system.
Feature-Level. Knowledge Distillation for Multi-task Learning (KD4MTL) (li2020knowledge), as shown in Fig. 7(e), first trains an offline task-specific network for each task, and then learns the multi-task network via adding the loss to minimize the distance between the task-specific network and the multi-task network. As the multi-task purpose network is capable of multiple tasks while the task-specific network is more professional at its own task, the two output features cannot be completely matched. Instead, the feature map from multi-task network, denoted by , is transformed via an adaptor . These adaptors are jointly learned with the multi-task network via the loss function defined as
(45) |
where is the feature map from an offline single network corresponding to the task , and is defined as the Euclidean distance between the two feature maps that is normalized.
Online KD for MTL (OKD-MTL) (jacob2023online) proposes an online knowledge distillation method to mitigate negative transfer across tasks. The adaptive feature distillation (AFD) loss with an online task weighting (OTW) scheme is designed to selectively train layers for each task. As shown in Fig. 7(g), the critical component AFD is an online weighted knowledge distillation performed on intermediate features from the shared ViT backbone of MTL, and the distilled features are from the teacher model that performs STL on each task. We denote by the total number of layers of the ViT encoder backbone and let denote the number of tasks. Then the AFD loss is defined as
(46) |
where denotes the learnable parameters for the -th task in the -th layer, which balances the multiple tasks. is the shared features learned from the teacher model at -th layer. The shared features can be distilled for each task features through Eq. (46) above. In the framework of OKD-MTL, the STL teacher and MTL students are trained in an end-to-end manner through the total loss
(47) |
To mitigate the gap between the MTL and STL losses, OTW adjusts the task weight for the -th task at iteration as follows:
(48) |
where serves as the temperature hyperparameter to control this task weighting process, and represents the iteration index.
Response-Level. Multi-Task Self-Training (MuST) (ghiasi2021multi) first trains555Pre-trained checkpoints are also recommended to alleviate computational burdens. the classification, detection, and segmentation teacher models from scratch on ImageNet (deng2009imagenet; russakovsky2015imagenet)/JFT-300M (sun2017revisiting), Objects365 (shao2019objects365), and COCO (kirillov2019panoptic), respectively. The knowledge is then transferred from these specialized teachers to a general-purpose student model via pseudo-labeling. Fig. 7(f) shows us as overview of MuST, every image in the shared dataset has supervision for all tasks, either supervised or pseudo labels. To balance these loss functions are tricky (See § 2.2.5) and MuST adopts (goyal2017accurate) for ImageNet experiments, where denotes the batch size, denotes the learning rate, the superscript indicates the student or teacher, and the total loss of MTL is defined as . For JFT300M, the algorithm in kendall2018multi was used to learn for each task. For depth loss, the weight was chosen by a parameter sweep. It has been validated that MuST can both rival supervised STL and enhance transfer learning performance.
CrossDistil (Cross-Task Knowledge Distillation) (yang2022cross) proposes a recommender framework that can transfer the fine-grained ranking knowledge about user’s preference towards items, as shown in Fig. 7(h). To facilitate fine-grained ranking, the training samples are divided into multiple subsets, taking into account all possible combinations of the tasks. For instance, in a recommender system where two tasks involve predicting “Buy” and “Like” for an item, the potential task combinations include “Buy:1, Like:1”, “Buy:1, Like:0”, “Buy:0, Like:1”, and “Buy:0, Like:0”. For simplicity, the division of multiple subsets on two tasks are:
(49) |
where represents the input feature vector from the whole dataset .
We denote by and so forth. The fine-grained ranking considers the corresponding multipartite order instead of bipartite orders, e.g., oder , which may be contradictory among different tasks. Based on the fine-grained ranking, an augmented loss is introduced for each task as
(50) |
where and are two hyper-parameters to balance the importance of pair-wise ranking relations and is the logit value before the sigmoid function . Additionally, and so forth. In contrast, the original regression-based loss function for each task is
(51) |
Based on Eqs. (50) and (51), CrossDistil regards the learning task of augmented loss as teachers and the learning task of regression-based loss as students, the distillation loss for each of task is
(52) |
where is learned and calibrated from Eq. (50), and an error correction mechanism is applied to ensure its alignment with the hard label . The original regression loss and knowledge distillation loss contribute to the learning of students for multiple tasks as
(53) |
where is a hyper-parameter to balance two loss functions. In this manner, by distilling the fine-grained ranking of task combinations, cross-task knowledge is effectively transferred.
2.2.4. Cross-Task Attention
Attention mechanism (niu2021review; brauwers2021general; guo2022attention) has been one of the most crucial concepts in RNNs, CNNs, and Transformers over the past decade in DL. Generally, attention is an information aggregation technique inspired by a human recognition system that tends to prioritize part of local regions over others when processing rich information. Under MTL settings, features from different tasks are more abundant than in STL, thus leading to a natural integration of the attention mechanism. Cross-task attention (bruggemann2021exploring), encoding task-aware features into cross-task queries, can perform task-association via refinement of multi-source features. Unlike feature fusion methods (misra2016cross; ruder2019latent; gao2019nddr) that propagate task-shared information among different task-specific branches, cross-task attention calculates what/how to share based on cross-task comparison between source tasks and target task. Considering the "morphological" aspect, the hard compartmentalization effect caused by a block-structured communication matrix in feature fusion methods could preserve the interference of features in some cases for tasks. This dilemma could be alleviated with a soft, learnable form of task-aware feature attention. Early works (xu2018pad; liu2019end; zhang2019pattern; zhou2020pattern; bruggemann2021exploring) build naïve attention modules (e.g., sigmoid function or inner product) to refine feature affinity or capture relational contexts across tasks, and then locate/diffuse features according to the attention map. PAD-Net (xu2018pad) and MTAN (liu2019end) select attentive features via an attention mask after the sigmoid activation. PAP (zhang2019pattern) and PSD (zhou2020pattern) iteratively diffuse features based on a cross-task affinity matrix. MTI-Net (vandenhende2020mti) first considers task interactions at multiple scales using both Sigmoid function and squeeze-and-excitation block (hu2018squeeze).
Transformer-based works exploit long-range dependencies using self-attention mechanisms.
Feature Filtering. Multi-Task Guided Prediction-And-Distillation Network (PAD-Net) (xu2018pad) utilizes the predictions from hierarchical auxiliary tasks as multi-modal inputs to distill knowledge for the final tasks. As shown in Fig. 7(i), the framework of PAD-Net, a hard parameter sharing-based encoder, extracts common feature maps that can be used for different tasks, and then the decoder for each auxiliary task generates intermediate predictions for the usage of multi-modal distillation. The source paper proposes three distillation modules to incorporate useful multi-modal information for the final tasks. Suppose the feature maps from -th task at -th layer is denoted as , which are transformed from predictions of -th task via convolutional layers. The output feature maps for the usage of -th task after the multi-modal distillation is represented as .
The first way to perform cross-modal distillation is a naïve concatenation via , which is then fed into the separate decoders for each task. Differently, the second way refines feature via passing knowledge from other tasks as below:
(54) |
where denotes the weight tensor of convolutions that maps the -th task to the -th task. Furthermore, the third way utilizes the sigmoid function to filter the passing knowledge, which learns an attention map for the -th task as follows:
(55) |
Then the knowledge is filtered via this attention map as follows:
(56) |
After the multi-modal distillation, the distilled feature maps are up-sampled for the final pixel-level prediction tasks.
Multi-Task Attention Network (MTAN) (liu2019end) presents a novel MTL architecture based on task-specific feature-wise attention, while global features are shared across different tasks. Suppose the shared global features are denoted by at the -th layer, and the features learned from task are denoted by . Then the feature-wise attention on the global feature pool is computed as follows:
(57) |
where is then concatenated with the features from the global pool again and fed into the task-specific convolution blocks. The attention map is learned in an end-to-end fashion as a parameter-free activation function.
To make the learning process more balanced between different tasks, liu2019end also suggests a simple yet effective Dynamic Weight Average (DWA) strategy (See § 2.2.5) to adjust losses according to their magnitudes in different epochs.
Multi-Scale Task Interaction Networks (MTI-Net) (vandenhende2020mti) aggregates multi-modal features at different scales from the decoder. As shown in Fig. 7(k), features at each scale are transformed and distilled by the feature propagation module and multi-modal distillation, respectively. This allows the model to capture task interactions at multiple scales. As the higher resolution scales have a limited receptive field, low-quality task-related features are presented. Simple upsampling and passing of task-related features from lower scales to higher scales (ronneberger2015u) inspire the design of the Feature Propagation Module (FPM). In this manner, features from different tasks at each scale are harmonized via the traditional convolutions and activation functions. To obtain the task-attentive features, a Sigmoid function along the task dimension is inserted to generate a task attention mask. To remedy the negative transfer among unrelated tasks, a per-task channel gating mechanism (SE, i.e. Squeeze-And-Excitation module (hu2018squeeze)) is used to refine the shared representations.
Furthermore, suppose the feature maps for the task at scale represented by , then the per-scale multi-modal distillation process for task is repeated as follows:
(58) |
where the Sigmoid function produces a spatial-wise attention mask to filter the features at different scales. and denote the weights to map features before attention. The FPM and multi-scale multi-modal distillation result in distilled cross-task features at every scale, which are then fed into the final aggregation module. The predictions are based on decoding these final representations via a task-specific head for each task.
Feature Diffusion. Pattern-Affinitive Propagation (PAP) (zhang2019pattern) builds a cross-task affinity matrix based on a spatial-wise attention mechanism and then iteratively diffuses features on each of the tasks to refine affinitive patterns among tasks. The detailed architecture is shown in Fig. 7(l). Suppose the feature maps before the computing of task-specific affinity matrix are denoted by , the affinity matrix for each task is computed using the inner product between each pair of spatial-wise feature vector with the length of :
(59) |
where is used to preserve the channel dimension. If the affinity matrix of each task is weighted by a learnable parameter , then the final affinity matrix for the task can be adaptively combined as follows:
(60) |
which is an adaptive combination process that can propagate the cross-task affinitive patterns for the target -th task. Furthermore, the cross-task affinitive patterns are used to iteratively diffuse features for each task:
(61) |
where denotes the diffusion step. In general, the multi-step iterative diffusion process propagates the affinity information best. Suppose the maximum of step is , finally the feature maps in the next layer are computed as follows:
(62) |
where is a hyperparameter to control the feature consistency.
Pattern-Structure Diffusion (PSD) (zhou2020pattern) utilizes a shared CNN encoder to extract feature maps that can be fed into the task-specific decoders, where the pattern structures are distilled within intra-task and across inter-task. As shown in Fig. 7(m), the intra-task PSD is used to transmit pattern structure within each task to enhance the task-specific patterns and then connect with inter-task PSD to correlate relations of pattern structures across different tasks. Without loss of generality, we assume a patch cropped at each position of feature maps as , where means the pattern at position . Then the pattern structure can be defined from the KNN graph on points within as follows:
(63) |
where is a fixed hyper-parameter set by user. To make pattern structure at different scale comparable, is further normalized as follows:
(64) |
Then the intra-task PSD can be formulated as a recursive process:
(65) |
where denotes the pattern structure of the whole feature map, is the neighbor set of the target pixel , and is a fixed hyper-parameter to control the residual connection. The iteration above contains multiple steps to guarantee that each local pattern is spread into distant regions, which is a diffused process.
To achieve cross-task pattern-structure propagation, inter-task PSD transfers the patterns from other tasks as follows:
(66) | s.t. |
where represent the transferred pattern-structures from task to the target task . In this manner, the PSD method distills feature similarity across different tasks.
Soft Attention. Attentive Single-Tasking of Multiple Tasks (ASTMT) (maninis2019attentive) argues the dilemma that the critical information from one task to another could be a nuisance while inferring multiple tasks together. ASTMT addresses it by single-tasking, a strategy that executes one task at a time instead of inferring all of them simultaneously. Technically, every task shares a backbone network in a hard manner but adapts its specificity with residual adapter (RA) branches, which is shown in Fig. 7(n). Suppose the RA operation is represented by for the -th task, and its original residual skip connection is . Then the single-tasking process by RA is calculated as below:
(67) |
where denotes the residual connection that is not influenced by the task. can be naïve bottleneck convolutions or transformed to an attentive block (e.g. SE-ResNet block (hu2018squeeze)). In order to address the limitation of this adaptation failing to disentangle the shared and task-specific space, a GRadiEnt Adversarial Training (GREAT) process (sinha2018gradient) is introduced for different tasks to ensure that the shared backbone learns the shared representations and maintains this quality during the single-tasking process. More details of multi-task adversarial training are shown in § 2.2.7.
Adaptive Task-Relational Context (ATRC) module (bruggemann2021exploring) enables global cross-task and local spatial-wise attention mechanisms to refine each task prediction, which is a general module that can be applied to any backbones across any supervised dense prediction tasks. The ATRC refinements begin with a hard-parameter sharing encoder, of which each task head can generate task-specific features and auxiliary predictions , where . Specifically, the features of each target task is refined by attending to the features of every available task within a separate Context Pooling (CP) block. As shown in Fig. 7(o), the original features and refined features are combined to predict the target task .
There are three categories of context information (global context, local context, and label context) to be learned via refining features from the source task to the target task. The detailed illustration can be observed in Fig. 12 positioned to the right. Each CP block accepts the features and predictions from the source task and target task, respectively. and are transformed into queries , keys and values (flattening along the spatial dimension and preserving channel dimension) as below:
(68) |
where is a CONV-BN-ReLU operation, and . In the attention of global context, a target feature value at position is substituted with
(69) |
where denotes the number of total pixels (i.e. feature values) and sim denotes an arbitrary similarity function. For the local context attention, let us denote by the 2D spatial neighborhood of target pixel at position with the patch extent , then the spatial-wise local attention is formulated as below:
(70) |
where is the channel dimension of . For the -label context and -label context defined in the label space that is partitioned into a set of disjoint label regions. The aim is to find a prototypical representation for each pixel. Suppose , where each entry of the last dimension indicates the degree that a pixel belongs to a label region . For the -label context, the keys and values are calculated via the the region prototypes as below:
(71) |
where denotes the softmax normalization over the spatial dimension, and the matrix represents the region prototypes. Alternatively, is substituted with the source task prediction maps in the -label context. The outputs of both are attention-weighted combinations of features :
(72) |
Deformable Mixer Transformers (DeMT) (zhang2023demt) is an encoder-decoder architecture that combines the merits of deformable CNNs (dai2017deformable; zhu2019deformable) and attention-based ViT (dosovitskiy2021an) to model multiple tasks, the details are shown in Fig. 7(p). The encoder, aka the deformable mixer in zhang2023demt, is aware of feature mixing across channels through convlutions and captures the deformable spatial features through learnable offsets. After task-specific features are learned by the encoder part, the task-aware transformer decoder first applies the task interactions based on the attention mechanism (MHSA + MLP) and then constructs the task query block to decode the task awareness features for each task. Suppose the transformer operator inside the task interaction block can be abstracted as
(73) |
where denotes the layer norm on fused feature , and the subscripts and denote the feature index before and after the task interaction block, respectively. To decode task awareness in the task query block, another transformer involves task-specific query before (i.e., ):
(74) |
where the subscript denotes the feature index after the task query block.
2.2.5. Scalarization Approach.
One of the most popular methods to solve multi-task learning problems is the scalarization approach, which formulates the problem as a linear combination of loss functions of different tasks (kendall2018multi; liu2019end; chen2018gradnorm; Senushkin_2023_CVPR) as
(75) |
where are the tasks’ weights and are used to encode preferences over different tasks. is the model parameter and are loss functions for different tasks. In each loss function , we drop the dependency on training samples to avoid cluttered notations.
Gradient-based methods are perhaps the most popular choices to solve Eq. (75), whose update rule of takes the form of , where is the learning rate and is the search direction. is a function of , for example, . Aside from the challenge of choosing a proper learning rate , there are two additional challenges, dominant gradients and conflicting gradients, see Fig. 13 for an illustration. Dominating gradient issue occurs when the norm of gradients of some tasks’ losses are significantly larger than the others, hence the updating direction are biased towards to tasks with larger gradient norm. Conflicting gradients issue arises when one makes progress in one task, the performance of another task is degraded.
In the remainder of this section, we review some works with different philosophies to address dominant and conflicting gradients’ challenges. These methods can be roughly characterized as gradient correction approach, where transformations are made to gradients to address the conflicting gradients issue and dynamic weighting, where are updated in each iteration to address the dominant gradients issue.
Gradient Correction. Projecting Conflicting Gradients (PCGrad) (yu2020gradient) proposes to mitigate the conflicting gradients issue by projecting the conflicting gradients in the orthogonal subspace. Formally, PCGrad (yu2020gradient) defines two gradients to be conflicting if . To address this issue, instead of forming the search direction as , PCGrad suggested using , where and and Proj is the Euclidean projection operator. See Fig. 14 for an illustration. This method, from the perspective of multi-objective optimization perspective (which will be discussed in the next section), is a particular choice of choosing a common descent direction. Gradient sign Dropout (GradDrop)(chen2020just) attributed conflicts to the differences in the signs of gradients along each coordinate direction. Motivated by the dropout, a probabilistic masking procedure is proposed to keep only gradients consistent in signs in each update. Conflict-Averse Gradient descent (CAGrad) (liu2021conflictaverse) proposes to mitigate gradient conflicts by solving the problem
(76) |
where is a prescribed parameter. The intuition is that can be used as the approximated evaluation of the conflict among objectives, and one wants to find the direction that minimizes such a conflict while stays close to the original negative gradient of .
Reducing conflicting gradient (Recon) (shi2023recon) empirically observes that PCGrad, GradDrop, and CAGrad (yu2020gradient; chen2020just; liu2021conflictaverse) can only slightly reduce the occurrence of conflicting gradients (compared to joint-training666The joint-training refers to the case that for all in Eq. (75).) in some cases, and in some other cases they even increase the occurrence. Therefore, Recon proposed to analyze parameters in a layer-wise fashion to pinpoint the shared parameters that are most likely to incur conflicting gradients. Concretely, let be the gradients of the task pair with respect to the th layer’s parameters. is said to be -conflicting if for any . Recon first trained the models via any gradient-based method with epochs, e.g., PCGrad, GradDrop, and CAGrad, and then derived the conflicting scores for each layer over epochs to identify the top layers with the highest (most negative) conflicting scores. Finally, Recon turned these layers’ parameters into task-specific parameters and retrained the network from scratch. As pointed out in shi2023recon, while Recon is sensitive to the parameters and , one only needs to tune them once for a given network architecture.
Dynamic weighting. GradNorm proposed in chen2018gradnorm suggests to mitigate the dominant gradient issue so that gradients for each task have the proper magnitude. The strategy to adjust is based on the average gradient norm of each task and the relative progress achieved for each task. With this information, GradNorm constructs a reference point at each iteration, was then selected to minimize the distance between the actual gradient of each task and the reference point. Concretely, let be the measure of norm of the th task’s weighted gradient at iteration 777We add addition index to indicate their dependence on the iteration counter .. Next, the averaged gradient norm across all tasks was calculated as . To measure the training progress of each task, was introduced, which inversely proportional to the training rate. Lastly, the relative inverse training rate for task can be formulated as . The higher value of indicates a higher gradient magnitude for task at iteration , which encourages task to learn more quickly. Finally, the weight was determined by solving the following problem
(77) |
where is introduced to avoid dramatically different learning dynamics between tasks caused by various task complexity. Inspired by GradNorm, Dynamic Weight Averaging (DWA) is another strategy proposed in liu2019end to balance the task-specific losses. The updating process of is defined as where is the relative progress for the task at the iteration . Reinforced MTL (RMTL) (liu2018exploration, Chapter 3) adjusts using the reinforcement learning strategy and Loss-Balanced Task Weighting. LBTW (liu2019loss) combines GradNorm and RMTL in a way such that the weights were adapted to both samples and tasks. Impartial MTL (IMTL) (liu2021towards) proposes to update in each iteration such that the aggregated gradient has equal projections onto the raw gradients of individual tasks. It achieves this goal by solving the following linear system (with respect to )
Before solving for , IMTL also proposes a heuristic to scale such that all losses are in the similar scales, which essentially is another scaling of the . Achievement-based MTL (yun2023achievement) suggests defining the weights for each task by measuring the training progress as where , , and are the current training accuracy (trained in the multitask setting) for the task and the max training accuracy (trained in the single setting), respectively. And Achievement-based MTL considers using the geometric mean instead of arithmetic mean to define the loss function; namely, it solves .
Uncertainty Weighting (kendall2018multi) takes a different perspective from the above dynamic weighting approaches. This work assumes there are underlying distributions for different tasks’ labels, and different tasks are independent. The final loss function, deriving from the likelihood perspective, takes the same form as Eq. (75) with being specified as the reciprocal of the variance of each distribution used to modeling each task and loss function. Instead of just optimizing over the parameter , kendall2018multi optimizes and simultaneously
(78) |
At this point, one can observe that all aforementioned works under the dynamic weighting category, excluding kendall2018multi, do not necessarily respect optimization problem formulation in Eq. (75) even though they empirically work well in producing useful solutions. Nonetheless, one can also regard the dynamic weighting approach as either solving Eq. (78) using different rule-based strategies to update or using gradient-based methods to inexactly solve a sequence of problems in the form of Eq. (75).
To conclude this section, we point out that there are some works that try to address two issues simultaneously (javaloy2022rotograd; Senushkin_2023_CVPR). For example, Alignment for MTL (Aligned-MTL (Senushkin_2023_CVPR) considers the condition number of the linear system as a measure of the degree of the severeness of both gradient dominance and conflict, where and . Therefore, the authors propose to find well-conditioned to approximate and, therefore, obtain a refined update direction . Concretely, the author proposed to solve , by singular value decomposition (SVD) and use the refined direction instead of . The convergence rate of the proposed algorithm is established under the assumption that all loss functions are Lipschitz smooth and bounded below. Although the numerical results are promising, one should be aware of the computation cost of the SVD despite the existence of efficient algorithms (bondhugula2006fast).
2.2.6. Multi-objective Optimization (MOO).
In contrast to the scalarization approach, which converts different objective functions into one aggregated objective function and then optimizes it, MOO, aims to simultaneously optimizing several objective functions (potentially conflicting). Concretely, MOO aims to solve the following problem
(79) |
where is the feasible domain for (examples will be given shortly). For a comprehensive background on the MOO topic, we refer readers to ehrgott2005multicriteria; for readers who prefer a quick overview of this subject, we recommend liu2020review. Below, we just provide the minimum backgrounds required to make the exposition accessible to readers with backgrounds in single objective optimization.
We begin with a few concepts that help readers understand the type of solutions that MOO algorithms can normally obtain.
Definition 4.
-
(1)
is called a weak Pareto minimizer of over if there is no such that . Here, is the element-wise comparison. The set is a weak Pareto minimizer is called the Pareto front.
-
(2)
is called a strict Pareto minimizer of over if there is no such that and .Here, is the element-wise comparison.
-
(3)
is called a Pareto stationary point of over if for all . Intuitively, this definition implies that for the objective function, there exists at least one such that there does not exist any feasible direction to further decrease it.
We give a graphical illustration of all these Pareto-related points in Fig. 15. In Fig. 15, the s that correspond to circles and crosses are Pareto stationary points. However, when are not convex, the Pareto stationary points can generate that are NOT sit on the Pareto font. An analogy for this phenomenon in single objective optimization would be that a stationary point of a nonconvex objective function may not be the global minimum. Due to the nonconvexity nature of neural networks, algorithms considered here (when the convergence analysis is provided), if not all, can only guarantee to find the Pareto stationary point instead of the weak/strict Pareto minimizers. However, if additional assumptions like (strong) convexity are assumed, then one can obtain solutions whose objective values are on the Pareto front. In the sequel, we review some works with different strategies to generate the a (set of) Pareto stationary point(s).
The first line of works, e.g., sener2018multi; lin2019pareto; navon2022multi were built upon and extended the seminal work, Multiple-Gradient Descent Algorithm (MGDA) (fliege2000steepest) to the neural network settings. The essence of MGDA is, at each iteration, to find a common descent direction that decreases all objective functions simultaneously. If no such direction exists, the algorithm terminates and returns a (set of) Pareto stationary point(s). MGDA constructs the common descent direction by solving the following optimization problem999For simplicity, we now only consider the unconstrained case ; we will discuss the constrained case shortly.
(80) |
In problem (80), if we drop the second order term , it intuitively tries to find the search direction that can maximize the minimal progress101010The progress is measured by the difference between of and the first order Taylor approximation of at . can be made. The second order term is added to guarantee the uniqueness of the solution of problem (80). The solution is known as the steepest common descent direction in the optimization literature. In deep neural network applications, however, can be of the billion scale, so it is very challenging to solve problem (80) directly. Instead of solving (80), MGDA-MTL (sener2018multi) considers to the solve the dual problem
(81) |
where is the -th element of the vector . One can see that the dual problem’s dimension reduces to , which is usually smaller than in several orders of magnitude and can be solved efficiently, e.g., Frank-Wolfe algorithm (jaggi2013revisiting) as is used in sener2018multi. The solution to the problem (80) can be recovered by the solution to the problem (81) as and the model parameter is updated as with . With proper assumption, iterates or a subsequence of the iterates converge to a Pareto stationary point. If all are convex, then the point that the iterates converge to is not only a Pareto stationary point but also is a weak Pareto minimizer, meaning its corresponding function value vector is on the Pareto front. MGDA-MTL further developed an efficient variant of MGDA when the neural network’s parameters can be decoupled as , and the common descent direction only needs to be found with respect to the part. Another work, Nash-MTL (navon2022multi), formulates the problem of finding the common descent direction as a bargain game. Concretely, the common descent direction is obtained as where and is a solution to the linear system111111 is the element-wise reciprocal. .
One potential issue with MGDA-MTL and Nash-MTL, more generally, MGDA-type methods are the algorithms that can only produce one Pareto stationary point instead of a set of Pareto stationary points. Producing a set of solutions has the advantage of allowing practitioners to choose one solution that best fits their needs. To address this issue, Pareto-MTL (lin2019pareto) considers restricting the solution produced by one run of MGDA in a certain domain such that is on a restricted region of the Pareto front 121212This is realizable only if the solution is a weak Pareto minimizer.. By carefully crafting the regions, the algorithm can generate well-separated solutions on the Pareto front. Specifically, assuming for all and that a set of preference vectors are given, Pareto-MTL considered to solve the problems in parallel, where the th problem is
(82) |
where . Intuitively, the constraints in Eq. (82) force the solution to stay close to in the angular space. The problem (82) is more challenging than problem (79) since it has nonlinear inequality constraints. Consequently, problem (80) is changed to account for these additional constraints. For more details, we refer readers to Eq. (14) in lin2019pareto. However, as pointed out in exact Pareto Optimal Search (EPO search) (mahapatra2020multi), Pareto-MTL does not guarantee that the solution matches the exact preference, and needs to grow exponentially fast as increases. Therefore EPO search re-designs the constraints and develops a new algorithm to search for the exact solution that matches the preference. Formally, EPO search proposes to solve
(83) |
where is the user-specified preference vector, takes the th elements, and is non-negative for all . Geometrically, this constraint enforces the solution in a way such that the ray intersects with the Pareto front at . Given an iterate , EPO search forms a search direction that tries to balance the constraint violation (the new iterate can “better" satisfy the constraint) and decrease all objective functions. Formally, the paper borrows the uniformity to measure the constraint violation by defining the non-uniformity measure with . One can easily check that if and only if satisfies the constraints. EPO search shows that taking a step along the direction can reduce the non-uniformity (constraint violation). Meanwhile, the common descent direction that reduces the all objective functions, if there exists any, takes form of , where and for all and . Then EPO search designs a linear programming problem to find a search direction that balances reducing constraint violation and reducing the loss functions guided by . For more details, please refer to mahapatra2020multi. Built upon EPO search, PHN (Pareto hyperNetworks) (navon2021learning) proposes to use hypernetwork, which takes the preference vector as the input and outputs the neural network weights for the multi-tasking, to attempt to learn the whole Pareto-front. Although the training is more challenging, if the hypernetwork could be properly trained, then at the inference time, the user can supply any preference vector , and the hypernetwork can output a Pareto stationary solution that closely aligns with the preference vector without requiring any additional efforts.
All aforementioned algorithms, despite their actual implementation, assume access to true gradients . This assumption might fail when in deep neural network settings. MoCo (multi-objective gradient correction ) (fernando2023mitigating) is proposed to address this issue. It extends MGDA to the stochastic setting, providing convergence rates for both convex and non-convex cases. The most notable challenge with extending MGDA to the stochastic setting lies in the noise of stochastic gradient estimators of true gradients . The standard way to address the issue is through the variance reduction technique. Unlike the seminar work of liu2021stochastic, which achieves the variance reduction via increasing batch sizes, MoCo (fernando2023mitigating) reduces the variance via the momentum-based method, which has the advantage of keeping the batch size as small as one while still guarantee the convergence (under proper assumptions). Concretely, at the th iteration, instead of solving problem (81), MoCo solves
(84) |
where , where projects vector to a ball centered at origin with radius , is the Lipschtiz constant of , is some positive constant, and is some approximation of . One can show that as , hence achieving the variance reduction.
To conclude this section, a comprehensive list, to our best knowledge, to include all existing optimization methods in § 2.2.5 & § 2.2.6, is summarized in Table 7.
Algorithm | Venue | Year | Method | Convergence | Highlight | Availability1 | ||||
Uncertainty Weighting | CVPR | kendall2018multi | Dynamic Weighting | — | Optimize and simultaneously. | Official | ||||
GradNorm | ICML | chen2018gradnorm | Dynamic Weighting | — |
|
Unofficial | ||||
MGDA-MTL | NeurIPS | sener2018multi | Multi-Objective Opt. | Asymptotic Convergence |
|
Official | ||||
RMTL | Thesis | liu2018exploration | Dynamic Weighting | — | Adjust is based on the relative progress achieved for each tasks. | Official | ||||
LBTW | AAAI | liu2019loss | Dynamic Weighting | — | Adjust using the reinforcement learning strategy. | Official | ||||
DWA | CVPR | liu2019end | Dynamic Weighting | — | is adapted to both samples and tasks. | Official | ||||
MLDT | CVPR | zheng2019pyramidal | Dynamic Weighting | — | is adapted to the likelihood of a loss reduction. | Official | ||||
Pareto MTL | NeurIPS | lin2019pareto | Multi-Objective Opt. | Asymptotic Convergence | Attemp to incorporate user’s preference into the solution. | Official | ||||
Controllable Pareto MTL | arXiv | lin2020controllable | Multi-Objective Opt. | — | Use a hypernetwork to learn the entire Pareto front. | Official | ||||
PCGrad | NeurIPS | yu2020gradient | Gradient Correction | — | Projecting onto orthogonal subspace to mitigate the gradient conflicts. | Official | ||||
GradDrop | NeurIPS | chen2020just | Gradient Correction | — | Only keep gradients are consistent in signs in each update. | Official | ||||
Continuous Pareto MTL | ICML | ma2020efficient | Multi-Objective Opt. | — | Construct a continuous, frst-order approximation of the local Pareto set. | Official | ||||
EPO Search | ICML | mahapatra2020multi | Multi-Objective Opt. | — |
|
Official | ||||
AuxiLearn | ICLR | navon2021auxiliary | Bi-level Opt. | — | Learn to combine losses in a nonlinear fashion. | Official | ||||
IMTL | ICLR | liu2021towards | Gradient Correction |
|
Unofficial | |||||
GradVac | ICLR | wang2021gradient | Dynamic Weighting | — | Encourage more geometrically aligned parameter updates for close tasks. | Unofficial | ||||
PHN | ICLR | navon2021learning | Multi-Objective Opt. | — | Use a hypernetwork to learn the entire Pareto front. | Official | ||||
CAGrad | NeurIPS | liu2021conflictaverse | Gradient Correction | Asymptotic Convergence | The search direction is find by solving a subproblem that is similar to MGDA. | Official | ||||
SVGD | NeurIPS | liu2021profiling | Multi-Objective Opt. |
|
|
Official | ||||
COSMOS | ICDM | ruchte2021scalable | — | — |
|
Official | ||||
HV Maximization | arXiv | deist2021multi | — | — | Utilize hyper-volume to approximate sample level Pareto front. | Official | ||||
PNG | UAI | ye2022optimization | Multi-Objective Opt. | Convergence rate for convex losses |
|
— | ||||
RLW & RGW | TMLR | lin2022reasonable | Dynamic Weighting |
|
|
Unofficial | ||||
Nash-MTL | ICML | navon2022multi | Multi-Objective Opt. | Asymptotic Convergence |
|
Official | ||||
(X)WC-MGDA | ICML | momma2022multi | Dynamic Weighting | — | Lift the restriction of non-negativity requirement on losses in EPO search. | — | ||||
Rotograd | ICLR | javaloy2022rotograd |
|
— |
|
Official | ||||
MoCo fernando2023mitigating | ICLR | 2023 | Multi-Objective Opt. |
|
Stochastic Gradient & Variance Reduction | Official | ||||
Recon | ICLR | shi2023recon | Gradient Correction | — |
|
Official | ||||
Aligned-MTL | CVPR | Senushkin_2023_CVPR | Gradient Correction | — |
|
Official | ||||
Achievement-based MTL | ICCV | yun2023achievement | Dynamic Weighting | — |
|
Official | ||||
FULLER | ICCV | huang2023fuller | Dynamic Weighting | — |
|
— |
2.2.7. Adversarial training
In the era of DL, joint task modeling has shown promising success by employing feature propagation or task balancing. However, it is important to acknowledge that task-specific features do not consistently result in mutual benefits, and learning multiple loosely connected tasks simultaneously introduces irrelevant noise. While task balancing helps alleviate the negative impact of transfer learning, it neglects the information exchange between tasks, often leading to suboptimal solutions. To address this issue, adversarial training (adhikarla2022memory), as an optimization approach, can effectively disentangle the space between task-shared and -specific features by inherently preventing feature interference. This approach involves introducing a task discriminator, which distinguishes features or gradients learned from different tasks. The discriminator is trained along with a shared feature extractor to converge to a saddle point where the discriminator is unable to differentiate features or gradients learned from different tasks. Research in this field can be categorized into two main approaches based on the type of information utilized for adversarial training: representation-based and gradient-based. ASP-MTL (aka AdvMTL) (liu2017adversarial) first proposes an adversarial MTL framework to learn task-shared and -specific features independently and introduces adversarial training to make shared features invariant to the involved tasks. MTAN (liu2018multi) presents an adversarial MTL framework in the image generation tasks, where multiple existing factors for image generation are considered as tasks and disentangled in an adversarial way with the training of shared encoder. RD4MTL (meng2019representation) employs adversarial training to encourage the features from different tasks to be disentangled and the features of irrelevant tasks to be minimally informative. GREAT4MTL (sinha2018gradient) and AAMTRL (mao2020adaptive) utilize the gradients derived from different tasks and disentangle the space using gradient reversal procedure (ganin2016domain).
Representation-Based. Adversarial Shared-Private Multi-Task Learning (ASP-MTL, aka AdvMTL) (liu2017adversarial) first proposes an adversarial MTL framework to alleviate the interference of shared and specific feature spaces among involved tasks. The underlying observation is the fact that the same word in a sentence may indicate different sentiments in different tasks, e.g. the "infantile" in product reviews "The infantile cart is simple and easy to use." and product review "This kind of humor is infantile and boring.". "infantile" is a potential backdoor word encoded in the shared feature space as it expresses a neutral attitude in the product review while it conveys a negative attitude in the movie review. ASP-MTL addresses this issue by dividing the feature space into shared and specific (private) space in a parallel manner, as shown in Fig. 7(t), and disentangles them using orthogonality constraints and adversarial losses. Let and denote the representations of shared and private layers for the -th task, respectively. The adversarial training process alternates between the shared feature generator (parametrized by ) and the task discriminator (parametrized by ) through a minimax optimization:
(85) |
where denotes the ground-truth label to indicate the task type, and means the use of Cross Entropy loss in practice. To further extract task invariant features from the shared layers, ASP-MTL introduces the orthogonality constraint as follows to disentangle the shared and private feature space.
(86) |
where we abuse the vectorization vec to preserve the sample dimension of the output feature tensors. The final learning objective function consists of three components as below:
(87) |
where is the task-specific objective for the -th task, and and are hyper-parameters to balance the learning terms. This total objective is trained with backpropagation via the advantage of gradient reversal layer (GRL) (ganin2015unsupervised).
Multi-Task Adversarial Network (MTAN) (liu2018multi) targets the problem of multiple factors existing in image generation. The architecture of MTAN is shown in Fig. 7(u), where the shared encoder extracts the features that are disentangled across style factors for the use of content classification (discriminator ) and generation (generator ). Let the original image and the corresponding content label be represented by and , respectively. Then the training of the generation task entails the updation of shared feature extractor and generator :
(88) |
where are data indices. is sampled from the style label codebook . Eq. (88) means that the generator tries to reconstruct the data itself if and tries to minimize the distance between the style-transferred and any sample with the same content and style labels (i.e. ) otherwise.
The key adversarial training of style labels is defined using Earth Mover’s Distance (EMD) loss (arjovsky2017wasserstein) as follows:
(89) |
where is a gradient penalty term (gulrajani2017improved) for the purpose of training stability and serves as a trade-off hyper-parameter. To add the classification of content factor, the total training objective is formulated as follows:
(90) |
where denotes the Cross-Entropy loss of the content classification task, and both are the hyper-parameters.
Representation Disentanglement for Multi-Task Learning (RD4MTL) (meng2019representation) aims to disentangle the indiscriminate mixing of properties in medical image analysis. As depicted in Fig. 7(v), an adversarial training process encourages the features from different tasks to be disentangled and minimally informative. Let represent the latent features extracted by the specific encoder from the original image , then as the -th task-specific classification loss can be calculated as follows:
(91) |
where is the ground truth label of the -th task, and is the Cross Entropy loss in practice. Furthermore, the adversarial regularization uses a minimax competition process as below:
(92) |
then the total training objective can be formulated as follows:
(93) |
where balances the two loss terms.
Adaptive Adversarial Multi-Task Representation Learning (AAMTRL) (mao2020adaptive) investigates the theoretical mechanism of adversarial MTL via using Lagrangian duality, and further proposes the AAMTRL that can improve the performance of classical adversarial MTL (aka AMTRL methods in (mao2020adaptive)). For simplicity, if the shared and -private features for the -th are represented by and , aligning with the formalization in Eq. (85). Assume the shared feature extractor (parametrized by ) and task discriminator (parametrized by ) to be Bayes-optimal, AAMTRL introduces the matrix to measure the task relatedness, where
(94) |
where is the -th entry of the matrix , and represents the probability that the discriminator classify the input representations as -th task type. In AAMTRL, the adaptation is realized by the weighting strategy of task-specific objectives :
(95) |
The classic adversarial MTRL problem can be regard as the Lagrangian dual function of the following equality-constrained optimization problem:
(96) |
To avoid the sub-optimal solution of the traditional Lagrangian duality in solving the problem above, an augmented Lagrangian with a quadratic form is proposed as follows:
(97) |
where is the Lagrangian multiplier, and is the penalty hyper-parameter that can balance the duality gap. By using Lagrangian duality, AAMTRL can have an exact generalization error bound that is minimally investigated in the classic AMTRL.
Gradient-Based. GRadiEnt Adversarial Training for MTL (GREAT4MTL) (sinha2018gradient) is one of the scenarios of GRadiEnt Adversarial Training (GREAT) that tries to make the gradients indistinguishable across involved tasks. As depicted in Fig. 7(w), the encoder extracts shared features for multiple tasks, and the decoders are used to perform involved tasks. Thus, the basic learning objectives for specific tasks are:
(98) |
where is the total dataset containing tasks, and is dependent on the task type. In GREAT4MTL, the Gradient-Alignment Layer (GAL) is placed after the shared encoder and before the task-specific decoders to perform task discrimination. Unlike representation-based methods that attend to the features, is trained using gradients from different tasks as inputs:
(99) |
where is the ground truth label to indicate the task type, and the Cross-Entropy loss is used to calculate the task classification error. Then the total training objective function is:
(100) |
The GRL is inserted before the GAL to streamline the minimax optimization process above. The trade-off hyper-parameter is eliminated in Eq. (100) by using different learning rates during the training process of and .
ASTMT (maninis2019attentive) also employs the GREAT strategy to effectively disentangle the task-shared and task-specific features acquired from the shared backbone and single-tasking components, as illustrated in the right portion of Fig. 7(n). It highlights the compatibility of GREAT to be seamlessly integrated with other frameworks.
2.2.8. Mixture of Experts (MoE)
Deep neural-based architectures have been extensively utilized in real-world MTL problems. However, the challenge of scaling high-capacity deep neural networks to adapt to multi-task settings remains conceptually appealing. The MoE (jacobs1991adaptive) framework inherently incorporates multiple expert networks, each of which can be selected for learning different tasks. The modern MoE layer (eigen2013learning; shazeer2017) has transformed the MoE module into a universally adaptable component that seamlessly integrates into various systems, including CNNs, RNNs, and Transformers, enabling plug-and-play functionality. The MoE layer, as depicted in Fig. 15(a), generally comprises a set of expert networks and a gating network , whose output depends on the input data . This gating network generates a sparse -dimensional vector that selects the necessary expert networks to compute the final prediction as follows:
(101) |
where is the -th entry of the sparse vector generated by the gating network , and represents the. Beyond MoE for STL, Multi-gate Mixture-of-Experts (MMoE) (ma2018modeling) explicitly introduces multiple gates/routers () for each task, as shown in Fig. 15(b). The final prediction for the -th task is calculated as
(102) |
where represents the sampled data from -th task. This prior research has inspired the development and utilization of multi-router MoE for MTL. It includes DSelect-k that selects top experts for each task, MT-Tag (gupta2022sparsely), demonstrating the robustness of Multi-Router MoE to the loosely related tasks, CmoIE (wang2022multi), which constructs more insightful experts instead of incompetent ones, Mod-Squad (chen2023mod), specializing experts for specific tasks by measuring the mutual information (MI) between tasks and experts, and SummaReranker (ravaut2022summareranker), performing re-ranking on a set of summary candidates to select the best one. On the other hand, task-conditioned routing with a shared router/gate is another variant where task-dependent representations are fed into the only existing router, making their expert selections, as depicted in Fig. 15(c) for comparison. The shared-router MoE is discussed separately from the Multi-router MoE in M3ViT(fan2022m3vit). Task-level MoE (ye2022eliciting) designs different router architectures with varying complexities under shared-router settings, including MLP, LSTM, and Transformer. In both ways, task relationships are captured in different mixture patterns of experts assembling.
Multi-Router MoE. Multi-gate Mixture of Experts (MMoE) (ma2018modeling) replaces the shared layers in the hard parameter architecture with multiple MoE layers and retains individual routers for each task, resembling the soft parameter architecture. The computational process of predicting -th task is shown in Eq. (102). The router networks of MMoE is the softmax of the linear transformations of the input data representation:
(103) |
where , and is the number of experts and the number of features. In comparison to the soft parameter sharing architecture, MMoE features routers solely for each task, resulting in a lighter size and enhanced scalability with an increasing number of tasks. In addition, the conditional computation (bengio2013estimating; shazeer2017) of the MoE layer requires the activation of only specific parts of the experts on a per-example basis. While shazeer2017 offers a top- gating function by adding tunable Gaussian noise, the theoretically scary discontinuities can lead to convergence issues if learning via gradient-based optimization.
Differentiable Selection of top- experts(DSelect-) (hazimeh2021dselect) bridges this gap by proposing a continuously differentiable and sparse gate in the context of MMoE. Obviously, the direct cardinality constraint ( norm) on the output vector of the gate function is not amenable to SGD. To address this issue, a binary encoding scheme is introduced to realize top- selection via unconstrained minimization. Let denote a matrix that selects the top- experts, whose -th row is a -dimensional binary encoding of the index of any single expert, where and is the number of total experts. The gate output vector is defined as follows:
(104) |
where is a learnable vector to control the importance of the final selected top- experts, and defines the single expert selector that returns a one-hot encoding of the index of some selected expert. It is noticeable that and , which realize the similar property for the gate output without any constraint involved. Furthermore, DSelect-k using a element-wise smoothing function to relax every binary variable in to be continuous in the range :
(105) |
where is a hyper-parameter that controls the width of the fractional region. Eqs. (104) and (105) transform the top- selection to be unconstrained and first-order differentiable.
Multi-Task Task-aware Gating (MT-TaG) (gupta2022sparsely) designs the task-aware sparse gating function to route expert selection for each task. The incorporation of task-conditioned information into the routing mechanism is realized by constraining each embedding to only the top- expert selection. Let be the token/embedding representation in the -th position of the input sequence for the -th task. A linear mapping process is first applied to obtain the touting logits below:
(106) |
then the only expert routing is as follows through a softmax process:
(107) |
where denotes the task-conditioned representation calculated by the selected experts. Noticeably, the task relationship is implicitly encompassed within the variable , thereby remaining independent of the experts involved. SummaReranker (ravaut2022summareranker) targets only the abstractive summarization task but utilizes different metrics to measure it. The re-ranking on a set of summary candidates generated by MMoE can consistently promote the base model.
However, the promise of MMoE has been validated in MTL with the explicit task relationship backups. Calibrated Mixture of Insightful Experts(CMoIE) (wang2022multi) investigates the negative transfer in MMoE caused by incompetent experts in certain applications. Specifically, a conflict resolution module between each pair of experts and the expert communication among the layers of different experts are introduced to advocate the diversity and capacity of experts. Additionally, a mixture calibration structure employed in the routing networks encourages the expert responsibilities to handle more tasks without losing their specialty. For any input data , the conflict resolution employs the Euclidean distance to measure the outputs from each pair of experts:
(108) |
where is the number of total experts, and denotes the distance matrix between each pair of experts. Based on the max-margin -distribution, the corresponding conflict attention matrix for each pair of experts is calculated to highlight the excessively similar expert pairs:
(109) |
where is the conflict radius of the expert that defines the upper quartile of . Furthermore, the conflict loss is proposed as follows:
(110) |
where is combined with multi-task loss in an end-to-end training process. To capture implicit task relationships by constructing task-aware representations, the fusion matrix is defined using multilinear map as follows:
(111) |
where and are the routing networks and another hidden-layer gating network before the -th layer for the experts. Let the hidden representations at the -th layer of denote by , and then stack all of them by the way of to be the hidden representation matrix . Through the fusion process defined in Eq. (111), the input of -th layer of is diffused by:
(112) |
where the representation is tailored by the task-specific fusion matrix. The residual block () above can suppress the individuality ruin of experts during the fusion process. To further enhance the specialization and concentration of experts on specific tasks, the mixture calibration introduces a dynamic temperature to control the logits for each routing network:
(113) |
where the temperature parameters are progressively decreased from during the training process.
Mod-Squad (chen2023mod) also allows cooperation and specialization in the process of matching experts and tasks. To make the experts dependent on tasks, the mutual information between them is first measured as below:
(114) |
where the joint probability will be decided by the number of data that are routed inside a task to the target expert. Then the total loss can be formulated as follows:
(115) |
where is the hyper parameter to control the -th task-specific loss , and balances the multi-task loss term and mutual information term.
Shared-Router (Task-Conditioned) MoE. Task-Level MoE (ye2022eliciting) first uses a shared router that takes the task representation as input, which is selected from a look-up embedding table. Moreover, Task-Level MoE first investigates the combinations of different backbone (MLP, LSTM, and Transformer) and softmax (softmax, Gumbel-Softmax, and ST Gumbel-Softmax (jang2016categorical)) variations of routers. M3ViT (fan2022m3vit) customizes MoE into a ViT backbone, which compares the multi-router MoE and shared-router MoE. ViT-based MMoE can feature hardware memory efficiency, as certified in Edge-MoE (sarkar2023edge).
To circumvent the limitations associated with a fixed single expert, the AdaMV-MoE (chen2023adamv), denoted as the Adaptive Mixture of Experts framework for Multi-task Vision Recognition, possesses the capacity to autonomously ascertain the number of sparsely activated MoE based on input token embeddings. Task-specific router networks are employed to select the most relevant experts for individual tasks. This process can be mathematically expressed as:
(116) |
where is the router for -th task. It should be noted that the number of experts () engaged is not predefined. AdaMV-MoE incorporates an adaptive mechanism, specifically the Adaptive Expert Selection (AES) technique, to dynamically adjust this quantity based on task-specific loss values observed during validation on datasets (). If exhibits no signs of decline over several iterations, the number of experts () should be augmented by 1. In contrast, if it exceeds the best loss value above, the number of experts should be reduced. Ultimately, after numerous iterations, the number of experts can be stabilized.
2.2.9. Graph based
Graphs have been widely used in data mining and machine learning due to their unique representation of objects and their interactions. Graph neural networks (GNNs) (sperduti1997supervised; gori2005new; scarselli2008graph; wu2020comprehensive), which leverage nodes and edges among their connected nodes in graphs to conduct inference, have gained applause with impressing performance in capturing the inter-nodes relations on graphs. It is natural to consider the tasks and corresponding data samples in MTL as nodes and their relations as the edges to construct a graph for MTL (alon2017graph). Via conducting graph mining on such graphs, relations among tasks or data samples in MTL can be better understood so as to assist the final MTL model in conducting inference (chen2019multi; cao2022relational; liu2020asymmetric; liu2022structured)
MultiKernel (widmer2010leveraging) conducts MTL over a series of classification tasks with predefined hierarchical relations, which is often the case for biological problems. Notably, it constructs a tree that reflects the hierarchical relations between tasks and domains, where leaf nodes are the tasks it studies (e.g., dog), whose parent and ancestors (non-leaf nodes) are the corresponding biological domains (e.g., mammals and animals).
For a queried task , MultiKernel classifies it over every task ’s predictor by
(117) |
where is a pre-calculated constant inversely related to the distance between task and its ancestors . is the representation of task . The representations of nodes within the predefined tree are learned by minimizing the task error. is a learnable variable.
ML-GCN (chen2019multi) is a graph convolutional network (GCN)-based MTL model for capturing the label correlations in multi-label image recognition. Specifically, different from traditional MTL, ML-GCN pre-constructs a correlation matrix that reflects labels’ co-occurrence patterns within datasets. This matrix enables the system to build a label graph, where each node represents a label, and whose feature is the corresponding word embedding.
On retrieving the label graph, ML-GCN jointly trains a CNN and a GCN for the MTL. The CNN learns from image datasets to retrieve image representations, and the GCN learns from the label graph to generate label representations. ML-GCN retrieves multi-label prediction for an input image by computing dot products between image representations and label representations as , where and are the CNN model and its parameters respectively. is the set of label representations output the GCN.
ML-GCN resorts to the traditional multi-label classification loss for training. The entire construction of ML-GCN is shown in Fig. 7(x).
MetaLink (cao2022relational) assumes that, for a given data point, at the inference time, the multi-task model has access to its labels from auxiliary tasks. Based on this assumption, MetaLink leverages labels from other tasks to improve the predictive performance. Particularly, MetaLink constructs a knowledge graph to capture not only the task-task relations as in ML-GCN but also the inter- and intra-relations between tasks and data.
The knowledge graph consists of two types of nodes: (1) data nodes, whose features are embeddings computed by the neural networks, and (2) task nodes, whose features are the last layer weights of the corresponding task-specific neural networks. Whenever a data sample belongs to a task, an edge is connected between these two nodes, and the label of the edge describes how the data point is classified in the particular task. In this way, MetaLink transfers the traditional MTL to a link prediction task between data nodes and task nodes, as shown in Fig. 7(y).
In terms of updating the entire model, MetaLink does not specify the criterion or introduce any particular regularizing terms.
2.2.10. Neural Architecture Search (NAS)
NAS is a popular method in designing deep neural networks automatically, which has the potential to revolutionize the way neural networks are designed and used in many different fields, including MTL. NAS in MTL refers to the use of NAS to design neural networks that can perform multiple tasks simultaneously. This is different from traditional neural network design, where a separate network is typically trained for each task. In MTL, the goal is to learn a shared representation that can be used to perform multiple tasks effectively. Conventional architecture realizes multi-tasking by hard-parameter sharing that trains multiple task heads that share shallow feature extractors, e.g., TCDCN (zhang2014facial) and Fast RCNN (girshick2014rich; girshick2015fast), or by training separate neural network to perform all each task with the shared trunk, e.g., Cross-Stitch Networks (misra2016cross) and NDDR-CNN (gao2019nddr). However, the potential design space for deep multi-task neural architectures grows exponentially with the depth, and incorporating more tasks significantly expands the range of optimal solutions.
NAS can be used as an automatic approach to search for the optimal architecture for an MTL system. This involves defining a search space that includes a range of possible architectures and using a search algorithm to explore this space and identify the best-performing architecture. The search algorithm can be based on techniques such as reinforcement learning, evolutionary algorithms, or gradient-based optimization. There are several benefits to using NAS in multi-task learning. For example, it can reduce the need for manual design of the network architecture, improve the performance of the multi-task system, and reduce the amount of data and computation required to train the network. It can also be used to identify architectures that are more efficient and easier to implement in practice.
Fully-Adaptive Feature Sharing (FAFS) (lu2017fully) is the earliest method that trains networks with an adaptive widening process. The initial network is a slimmed-down version from reducing the number of convolutional filters in CNN or neurons in MLP. It gradually expands through a multi-round widening and training procedure, facilitated by a top-down splitting algorithm. In practice, the original active layer, depicted as the -th layer in Fig. 7(q), consists of numerous branches. These branches are then grouped together in the lower -th layer. Subsequently, the -th layer becomes the new active layer, and this iterative process continues from the top layers until the convergence.
Branched Multi-Task Networks (BMTN) (DBLP:conf/bmvc/VandenhendeGGB20) argues that learning layer sharing level in the early soft parameter sharing methods suffer from sub-optimal solutions, and relying solely on NAS to design the MTL architecture is significantly cumbersome. By leveraging the affinities of involved multiple tasks using Representation Similarity Analysis (RSA) (dwivedi2019representation), BMTN can automatically cluster the tasks at shared locations, in which bottom layers are task-agnostic and top layers gradually grow to be task-specific. For each task, as depicted in Fig. 7(r), BMTN initially computes the representation dissimilarity matrices (RDMs) between images at locations. The RDMs are defined as , where represents the Pearson correlation coefficient (pearson1895vii). Subsequently, the task affinity tensor is established based on the RDMs of all tasks using the Spearman’s correlation coefficient (spearman1961proof). Finally, BMTN is established by minimizing the sum of these task dissimilarity scores (i.e. ) between each pair of tasks and at every location , .
Multi-Task Learning by Neural Architecture Search (MTL-NAS) (gao2020mtl) is a method to search cross-task edges into fixed single-task network backbones. The framework is shown in Fig. 7(s). It involves a single-shot gradient-based search algorithm that can optimize the architecture weights overall legal connections defined by the search space. Specifically, this search algorithm contains the continuous relaxation and the discretization procedures. This novel search algorithm is able to close the performance gap between search and evaluation and also generalizes the popular single-shot gradient-based methods such as DARTS (liu2018darts) and SNAS (xie2018snas).
2.3. Foundation Model Era: Towards Unified and Versatile
AI models are shifting their focus from deeper networks (e.g., ConvNets (fukushima1980neocognitron; lecun1998gradient; he2016deep; liu2022convnet), GANs (goodfellow2020generative), CapsNets (sabour2017dynamic), RNNs (rumelhart1986learning; hochreiter1997long)) to foundation (e.g., BERT (devlin2018bert), GPT-4131313https://openai.com/research/gpt-4 (openai2023gpt4), SAM (kirillov2023segment), DALLE 3141414https://openai.com/dall-e-3 (ramesh2021zero)). Such foundation models leverage (usually in self-supervised, unsupervised, and assisted-manual ways) web-scale pretraining data in the wild and then adapt their backbones to different downstream tasks (bommasani2021opportunities; zhou2023comprehensive), thus inherently non-conflict towards MTL. In light of recent development of scalable learners, particularly Transformers, foundation models evolve from parameter-based transfer learning with new emergent capabilities. They facilitate the integration of multiple tasks into a pretrained backbone, achieved through only fine-tuning or even zero-shot learning (ZSL). In this context, the emergent properties in foundation models extend MTL from a fixed set of tasks (where training and test tasks are identical) to handling unknown tasks. When viewed from a task-oriented perspective, MTL, empowered by foundation models, can be categorized into three distinct types:
-
(1)
(Downstream) Task-Generalizable Fine-tuning. This category involves the uni-modal learning of inclusive representations in semi-supervised, self-supervised, and unsupervised learning manners. Notable examples include BiGAN (donahue2016adversarial; donahue2019large), BERT (devlin2018bert), MoCo (he2020momentum; chen2020improved; chen2021empirical), , SimCLR (chen2020simple; chen2020big), MAE (he2022masked), and GPT (radford2018improving; radford2019language; brown2020language; openai2023gpt4). The learned encoders should be transferable to a variety of downstream supervised tasks, thereby enabling them to be multi-task learners.
-
(2)
Task-Promptable Engineering. In this category, the original inputs are modified through task-specific prompts (e.g., SAM (kirillov2023segment)) during the pretraining stage. Prompt engineering can affect the representation of data and facilitate the learners with few-shot and even zero-shot abilities toward new tasks.
-
(3)
Task-Agnostic Unification. This category highlights that the representations remain unbiased toward specific tasks and data modalities via employing a unified serialization/sequence of data tokens, including Pix2Seq (chen2022pixseq; chen2022unified), UniTAB (yang2022unitab), Unified-IO (lu2022unified), Uni-Perceiver (nips_zhu2022uni; cvpr_zhu2022uni; li2023uni), OFA (wang2022ofa; bai2022ofasys), Gato (reed2022generalist), UnIVAL (shukor2023unified), etc. As a result, multi-modal learners can obtain the generalizability from existing tasks to new ones, even those involving diverse data modalities.
2.3.1. Downstream Task Fine-Tuning
At the moment of Pretrained Foundation Models (PFMs) (zhou2023comprehensive) inception, the terminology “pre-training” remained somewhat ambiguous within the field of DL research. This practice involves the initial learning of model backbones on a general dataset, e.g., ImageNet (deng2009imagenet; russakovsky2015imagenet), followed by their transfer to other tasks that commence fine-tuning with a warm-up initialization. Consequently, a similar process of “fine-tuning” before PFMs pertains to the fine-tuning of model backbones. In our context, fine-tuning with the changes of backbone parameters refers to model tuning, unless otherwise specified. It matters since PFMs are costly to backpropagate, and the ability to generalize large frozen backbone to multiple downstream tasks referred to as downstream fine-tuning, can ease this burden. By confining our discussion to the context of downstream fine-tuning within the frozen model, we can extend the previous definition of MTL (refer to Definition 3). In this context, a single model can effectively handle a set of tasks. This approach also facilitates a clear separation from the domain of (parameter-based) TL.
In the context of fine-tuning for downstream tasks facilitated by PFMs, the process typically begins with the pre-training of a backbone foundation model on large data in the wild. This pre-training phase often employs unsupervised or self-supervised methods. Subsequently, the pretrained backbone is fine-tuned using task-specific domain datasets, as illustrated in Fig. 17(a). Leveraging the task-unbiased representations acquired from the frozen backbone, fine-tuning of task-specific heads (e.g., simple MLPs for classification tasks or mask decoders for dense prediction tasks) frequently yields competitive or even superior results when compared to prior supervised outcomes across a spectrum of diverse downstream tasks.
Nonetheless, it is important to note that the pre-training phase tends to restrict data modality due to the constraints of self-supervised techniques, which are inherently data-specific. For instance, methodologies like masked image modeling (MIM) in MAE are suitable for image data, while masked language modeling (MLM) in BERT is tailored for text data. Subsequent review provides an in-depth exploration of downstream task fine-tuning methods categorized by data modality. Specifically, we will discuss these methods these methods within the domains of vision, language, and vision-language tasks.
Vision Tasks. Early pre-training techniques in computer vision primarily focus on learning from pretext tasks. Exemplar CNN (dosovitskiy2014discriminative; alexey2016discriminative), for instance, initially pretrains backbone models by discriminating various patches within unlabeled data. In the case of Inpainting (pathak2016context), the pretext task involves predicting the masked central parts of images. Colorization (zhang2016colorful), on the other hand, establishes mappings from grayscale images to their colored versions. Split-Brain Autoencoders (zhang2017split) forces the network to split into two disjoint sub-networks, each processing one-half of the input images while predicting the corresponding missing parts from the other sub-network. Recently, BEiT (bao2021beit; peng2022beit) and MAE (he2022masked) simply reconstruct the random mask patches of the images to pretrain the backbones, i.e., masked image modeling (MIM). Other MIM methods contain iBOT (zhou2021ibot), CAE (chen2023context), SimMIM (xie2022simmim), BEVT (wang2022bevt), ConMIM (yi2022masked), VideoMAE (tong2022videomae; wang2023videomae), to name a few. Jigsaw (noroozi2016unsupervised) and Completing Damaged Jigsaw Puzzles (CDJP) (kim2018learning) employ Jigsaw puzzles as pretext tasks during model pre-training. Counting (noroozi2017representation) can also serve as a pretext task to facilitate representation learning. Noise As Targets (NAT) (bojanowski2017unsupervised) focuses on learning representations by aligning the deep features of the backbone with predefined targets in a low-dimensional space. RotNet (gidaris2018unsupervised), however, is designed for predicting different image rotations. Notably, such early pre-training techniques of pretext tasks typically do not require manual annotations, allowing for fast training without the necessity of developing new loss functions. Downstream multiple tasks commonly include classification, object detection, and segmentation. Thus, parameter-efficient training (PEFT) of MTL models becomes challenging since the model must adapt to the needs of multiple tasks simultaneously. MTLoRA (agiza2024mtlora) is the first to address this problem and dominates other SOTA PEFT methods.
An alternative line of research aims to design a general representation learning algorithm that is unbiased to the pretext tasks, often referred to as contrastive self-supervised learning (SSL) (jaiswal2020survey; liu2021self). This method unlocks the potential of representations by introducing a novel loss function that hinges on the concept of “contrast.” If we denote the sets of samples that are similar and dissimilar to as and respectively, the Noise Contrastive Estimation (NCE) loss (gutmann2010noise) can be defined as
(118) |
where the function represents the encoder function used to learn image embedding. It is worth noting that the cosine-based similarity measurement mentioned above can be customized to suit various scenarios. Additionally, the InfoNCE loss (oord2018representation) extends this concept by incorporating a more extensive set of dissimilar pairs as
(119) |
where represents the batch size, comprising negative pairs and one positive pair . These loss functions are closely linked to the maximization of mutual information (MI) between the encoded representations.
Many contrastive SSL methods draw from the loss functions (118) and (119) to acquire task-invariant representations. Non-parametric instance discrimination (NPID) (wu2018unsupervised) can capture apparent similarity among instances using NCE. In contrast, contrastive predictive coding (CPC) (oord2018representation; henaff2020data) first introduces the InfoNCE loss for the pre-training of RNN in an autoregressive manner. Deep InfoMax (DIM) (hjelm2018learning), Deep Graph InfoMax (DGI) (velivckovic2018deep), and Augmented Multiscale DIM (AMDIM) (bachman2019learning) take a direct approach by maximizing the MI between representations. Contrastive multiview coding (CMC) (tian2020contrastive) extends the concept of MI maximization to incorporate more than two views, MoCo (he2020momentum; chen2020improved; chen2021empirical) employs InfoNCE but introduces the momentum contrast based on a memory bank used in (wu2018unsupervised). SimCLR (chen2020simple; chen2020big) proposes a novel contrastive loss known as the normalized temperature-scaled cross-entropy loss (NT-Xent) for representation learning. Bootstrap Your Own Latent (BYOL) (grill2020bootstrap), conversely, takes a different approach by obviating the need for negative pairs. On the other hand, several other methods (caron2018deep; caron2020unsupervised; goyal2021self; li2020prototypical) endeavor to employ clustering algorithms that contrast data representations based on class prototypes.
Language Tasks. In the domain of language, initial pre-training approaches utilizing word embeddings (mikolov2013distributed; pennington2014glove) to predict subsequent tokens for a warm start have shown potential in enhancing the performance of downstream NLP tasks (dai2015semi; mccann2017learned). Nonetheless, these methods often rely on a limited dataset for pre-training, which restricts their effectiveness and prevents consistently satisfactory outcomes across the spectrum of downstream NLP tasks. Current Transformer-based Pre-trained Foundations Models (PFMs) in natural language processing can be broadly classified into three types (wang2022pre): encoder-only, decoder-only, and encoder-decoder architectures. Encoder-only architectures employ a bidirectional Transformer encoder designed to reconstruct masked tokens. Decoder-only models utilize a unidirectional Transformer decoder that predicts tokens in a left-to-right autoregressive fashion. Encoder-decoder models are crafted for sequence-to-sequence (seq2seq) generation tasks, pretrained by masking tokens in the source sequence and predicting them in the target sequence.
This taxonomy aligns with the constraints in terms of tasks. Since the encoder-only architectures, e.g., BERT (devlin2018bert), ERNIE 1.0/2.0 (sun2019ernie; sun2020ernie), SpanBERT (joshi2020spanbert), DeBERTa (he2020deberta), and GLaM (du2022glam), are pretrained to predict masked tokens based on the bidirectional context, they are better suited for understanding tasks rather than generation tasks. They are adept at tasks like document classification, named entity recognition, and question answering where the full context is available and the task is to understand or extract information rather than generate it. Encoder-only models often have a fixed maximum sequence length, which limits their ability to handle very long documents directly. They are not designed for incremental token-by-token generation and thus are inefficient for tasks that require such predictions, like text completion or interactive text generation. Conversely, decoder-only architectures, e.g., GPT-3 (brown2020language), PanGu- (zeng2021pangu), Turing-NLG, HyperCLOVA (kim2021changes), Gopher (rae2021scaling), LaMDA (thoppilan2022lamda), PaLM (chowdhery2022palm), Open Pre-trained Transformers (OPT) (zhang2022opt), LLaMA (touvron2023llama; touvron2023llama), PanGu- (ren2023pangu) and PaLM-2 (anil2023palm), are pre-trained in a unidirectional context, making them well-suited for generative tasks such as language modeling and text generation. However, this unidirectional training means they may be less effective for tasks that require understanding the full context of the input, as they can only condition on the left context. These models generate one-text token at a time, which can be slower compared to models that handle the entire input at once, and they might struggle with tasks requiring bidirectional context. Encoder-decoder Architectures, e.g., T5 (raffel2020exploring), BART (lewis-etal-2020-bart), ERNIE 3.0 (sun2021ernie), Switch Transformers (fedus2022switch) and Flan-T5 (chung2022scaling), are more flexible as they can handle both understanding and generation tasks. While they offer considerable advantages in terms of their adaptability to various tasks, they come with trade-offs in terms of model complexity, resource requirements, and potential issues with error propagation.
Vision-Language Tasks. PFMs effectively manage multiple tasks without requiring model tuning. However, the aforementioned methods remain constrained to a unimodal context. In real-world scenarios, there is a natural requirement for multimodal or cross-modal intelligence. Such intelligence should handle multiple tasks across diverse modalities and domains. Vision-Language (VL), as its name implies, bridges CV and NLP. It was among the first areas to be extensively explored by the research community for multi-modal learning in recent years. Given the intricacy and scope of VL tasks, foundation models employing vision-language pre-training (VLP) have rapidly gained prominence, showcasing notable performance. Initial VLP approaches (su2019vl; li2019visualbert; tan2019lxmert; chen2020uniter; kim2021vilt; li2021align) centered on task-specific tasks such as visual question answering (VQA), image captioning, visual grounding, etc.
The advent of the contrastive language-image pre-training (CLIP) (radford2021learning), however, marks a significant leap forward in multiple downstream tasks, as it jointly refines dual encoders to align (image, text) pairs within latent embedding space, showcasing learning SOTA multimodal representations from unstructured image-text data. The general representations by cross-modal contrastive learning validate stellar performance in zero-shot transfer across various vision-language (VL) tasks. In a similar trajectory, the Large-scale Image and Noisy-text embedding (ALIGN) (jia2021scaling) method leverages uncurated data, amplifying the efficacy of VLP in downstream cross-modal retrieval tasks. Other contrastive VLP methods contain ALBEF (li2021align), WenLan (huo2021wenlan), triple contrastive learning (TCL) (yang2022vision), and BLIP (li2022blip; li2023blip). All these methods contribute to the learning of general-purpose visual and linguistic representations, seamlessly adapting to a variety of downstream tasks ranging from cross-modal reasoning (e.g., VQA) and cross-modal matching (e.g., Image Text Retrieval and Visual Referring Expression), to vision and language generation tasks. Notably, DALLE (ramesh2021zero) stands out in its remarkable capability to perform text-to-image generation tasks in a zero-shot manner, meeting commercial application standards. This underscores the potential and versatility of VLP in facilitating generalist applications.
2.3.2. Task Prompting
As the evolution of PFMs advances, the incorporation of prompting into the tuning process of frozen PFMs for downstream tasks has initially become widely recognized through the name of “prompt design” (brown2020language) and subsequently carried forward through the practice of “prompt tuning.” (lester2021power) Conceptually, prompts serve as carriers of task-descriptive information, enabling the adaptation of PFMs to various tasks in a manner that can be either manually crafted or automatically generated, as illustrated in Fig. 17(b). The primary use of prompts lies in their built-in ability to significantly alleviate the demands of task-specific fine-tuning through freezing backbone parameters of PFMs and only learning task-indicating prompts, ultimately leading to enhanced few-shot or even zero-shot generalizability, all while requiring augmenting inputs and maintaining minimal to no parameter updates. A comprehensive examination of prompt taxonomy exceeds the scope of this section. Consequently, we adopt the notion of task prompting to encompass all prompt engineering methodologies within the framework of task adaptation and generalization.
The additional task-specific prompts augmented with the model can be hard and soft (gu2023systematic). The hard prompts contain task instructions or hints from human-interpretable natural language, including human instructions (radford2019language; efrat2020turking) in the early stage and more advanced In-Context Learning (ICL) (dong2022survey) and chain-of-thought (CoT) (yu2023towards; chu2023survey). The soft prompts are also referred to as continuous prompting or prompt tuning that optimizes prompts implicitly in the embedding space, which can be learned/propagated to align with specific tasks.
Hard Prompt Engineering. Large Language Models (LLMs), via making predictions based on a few examples in the context, i.e. ICL, can finally perform different tasks. This learning from demonstration and analogy are also presented as emergent abilities (wei2022emergent) in LLMs. GPT-3 (brown2020language) first verified that LLMs are few-shot learners and that different tasks can be performed given a few examples in the form of demonstration context. InstructGPT (ouyang2022training) further aligned LLMs with user intent using reinforcement learning from human feedback (RLHF). The developments in ICL contain strategies both in training stage (wei2021finetuned; chen-etal-2022-improving; min-etal-2022-metaicl; wang2022super; iyer2022opt; wei2023symbol; gu2023pre) and inference stage (liu2021makes; rubin-etal-2022-learning; gonen2022demystifying; sorensen-etal-2022-information; zhang2022active; li2023finding; lu-etal-2022-fantastically; honovich2022instruction; zhou2022least; hao2022structured; xu2023small; xu2023k). FLAN (wei2021finetuned) tuned LLMs via natural language instruction templates over 60 NLP tasks and surpassed zero-shot GPT-3 on some of the datasets. MetaICL (min-etal-2022-metaicl) introduced meta-training for ICL on a more broad spectrum (100-level) of NLP tasks. Sup-NatInst (wang2022super) presented a benchmark of 1000-level NLP tasks and proposed T-Instruct that can outperform InstructGPT with fewer parameters. OPT-IML (iyer2022opt) Scales LLMs instruction meta-learning to 2000 NLP tasks through the lens of generalization. Symbol Tuning (wei2023symbol) targets the situation when instructions or natural language are insignificant in predicting the task. PICL (gu2023pre) enhanced the ICL ability for LLMs by pre-training to maintain task generalization, while previous investigations are how to select in-context examples for better few-shot capabilities during the testing stage (liu2021makes). Other methods (gonen2022demystifying; sorensen-etal-2022-information; zhang2022active; li2023finding; lu-etal-2022-fantastically; honovich2022instruction; zhou2022least; hao2022structured; xu2023small; xu2023k) tried to understand why the performance varifies from different prompts and how to pick better prompts from different angles. After prompt retriever (rubin-etal-2022-learning) is verified efficient for ICL, many efforts used the prompt pool as a tool to support retrieval-based prompting, where relevant prompts or context are retrived for ICL (rubin2021learning; li2023unified; ye2023compositional; zhang2023makes).
Furthermore, chain-of-thought (CoT) prompts are a series of instructions with progressive orders, which can help LLMs perform complex reasoning tasks step by step (wei2022chain; kojima2022large; zhang2022automatic; fu2022complexity; ho2022large; trivedi2022interleaving; chen2022program). Manual-CoT (wei2022chain) first explores how to improve the ability of LLM by generating CoT. Zero-Shot-CoT (kojima2022large) proposes a single task-agnostic zero-shot prompt to surpass ICL even without input-output demonstrations. Complex-CoT (fu2022complexity) shows that complex reasoning chains excel simple chains. Auto-CoT (zhang2022automatic) mitigates the mistakes that could happen in precious manual ways by automatically constructing demonstrations for different questions. Fine-tune-CoT (ho2022large) can use teacher-generated reasoning to fine-tune smaller models. IRCoT (trivedi2022interleaving) interleaves retrieval with steps and, in turn, improves the ability of CoT by retrieved results. PoT (chen2022program) uses programming language statements to delegate math computations.
Soft Prompt Tuning. In comparison, soft prompt tuning can backpropagate prompt vectors using gradient descent. lester2021power introduces the concept of “prompt tuning” and distinguishes it from previous model tuning and prompt design methods. During the training, prompt tuning can refine the prompts to improve learning performance on specific tasks. Thus, the multi-task setting can be realized by simply mixing training data across different tasks. Soft Prompt Transfer (SPoT) (vu2021spot) pioneers the demonstration that prompt tuning can efficiently transfer from source to target tasks, offering a parameter-efficient approach to prompt-based transfer learning across diverse tasks. P-Tuning (liu2022p) empirically optimizes prompt tuning to be universally effective across a wide range of tasks. ATTEntional Mixtures of Prompt Tuning (ATTEMPT) (asai2022attempt) exemplifies this concept by combining multiple prompts trained on large-scale source tasks, generalizing instance-wise prompts on target tasks while keeping model parameters and source prompts frozen. Multi-task Pre-trained Modular Prompt (MP2) (sun-etal-2023-multitask) enhances FSL for prompt tuning in multi-task settings. 10.1145/3583780.3614913 is the first to showcase that prompt learning achieves SOTA performance for MTL in FSL settings, even surpassing ChatGPT. Hierarchical Prompt (HiPro) learning (liu2023hierarchical) evaluates prompt tuning on standard MTL datasets and outperforms SOTA MTL methodologies by learning task-shared and task-individual prompts. Multitask Vision-Language Prompt Tuning (MVLPT) (shen2024multitask) incorporates cross-task knowledge into learning a single transferable prompt for vision-language models (VLMs). Prompt Guided Transformer (PGT) (lu2024prompt) introduces a prompt-conditioned Transformer block, integrating task-specific prompts into the self-attention mechanism, achieving global dependency modeling and parameter-efficient feature adaptation across multiple tasks. PromptonomyViT (PViT) model, as introduced in herzig2024promptonomyvit, leverages prompts to capture task-specific information in video Transformers.
Prefix-tuning li2021prefix is another lightweight alternative to fine-tune LLMs for different tasks while also keeping model parameters frozen. Prefix-tuning learns a continuous task-specific vector prefixed to the subsequent tokens. It can obtain comparable performance in the full data setting and outperform fine-tuning in low-data settings. chen2022unisumm proposes a Unified few-shot Summarization (UniSumm) model pretrained on multiple text summarization tasks, which exhibits the capability to generalize to different few-shot tasks through the utilization of prefix-tuning. chong2023leveraging trains a prefix transfer module to selectively leverage the knowledge from various prefixes according to the input text. Collaborative domain-Prefix tuning for cross-domain NER (CP-NER) (chen2023one) utilizes text-to-text generation, grounding domain-related instructions to transfer knowledge to new domain tasks. Prefix-tuning approaches highlight the importance of leveraging prefixes and domain-specific information for improving performance in multiple tasks.
2.3.3. Unified Generalist Models
The ambitious aspiration, shared by both research communities and industries, has always been to transition from specialization to unification, thereby constructing an ideal generalist model capable of addressing a diverse set of tasks with varying modalities. The advent of large language models (LLMs)
The blueprint of designing general-purpose multimodal foundation models aligns with the recent unified models such as Gato (reed2022generalist), Unified-IO (lu2022unified), and OFA (wang2022ofa), Uni-Perceiver (zhu2022uni; li2023uni), etc. These methods can perform a variety of tasks spanning from CV to NLP, without modality limitations. Please see Fig. 19 as an illustration.
To pretrain via a Transformer backbone for the general MTL usage, we need to tokenize the input multi-modal data. For images, the commmon practice should obey the sequencing of non-overlapping patches in raster order in ViT (dosovitskiy2020image), with the size of for each patch. Typically, the bounding boxes of objects in region-based tasks are represented by the quantization scheme of Pix2Seq (chen2022pixseq). In the text preprocessing, the OFA framework adopts the exact same BPE Tokenizer (sennrich2015neural) used in BART (lewis-etal-2020-bart), and its tokens are originally ordered along with the raw input text. Based on this prepossessing, it is possible to build a unified vocabulary for all visual, linguistic, and multi-modal tokens. After that, suppose we are given a sequence of tokens as input, where indexes the tokens in a data sample and indexes a sample in a training batch. The architecture for a unified model is parametrized by . Then we are able to autoregressively train the model via the chain rule as follows:
(120) |
The concept of a unified architecture for multi-modal MTL can be traced back to OmniNet (pramanik2019omninet), taking insights from the potentials of Transformers such as, pramanik2019omninet propose a single model in their work to support tasks with multiple input modalities as well as asynchronous MTL. lu202012 investigates the relationships between vision-language (VL) tasks, and proposes a single model targeting 12 datasets simultaneously. li2021towards introduces the concept of unified foundation models by jointly pre-training Transformers on unpaired images and text data. Unified Transformer (UniT) model (hu2021unit) is a realization of this concept. It first features separate encoders for different input modalities and a shared decoder over the encoded input representations. Each task is associated with specific heads in the shared decoder. Unified Foundation Model wang2022ofa; bai2022ofasys proposes One-for-All (OFA) as a task-agnostic and modality-agnostic framework. OFA aims to unify task-specific layers for downstream tasks, providing a versatile solution. However, it is important to note that OFA currently lacks support for video data and necessitates fine-tuning for downstream tasks. Uni-Perceiver (zhu2022uni) is a unified architecture for generic perception for zero-shot and few-shot tasks, which includes a video tokenizer with temporal positional embeddings. Uni-Perceiver v2 (li2023uni) further introduces task-balanced gradient normalization to ensure stable MTL, which enables larger batch-size training for various tasks. More importantly, unlike OFA (wang2022ofa), Uni-Perceiver v2 requires no task-specific adaptation. Mask DETR with Improved deNoising anchOr boxes (Mask DINO) (li2023mask) is a unified framework designed for object detection and segmentation. Mask DINO uses an additional mask prediction branch to unify the query selection for masks. All-in-one Transformer (wang2023all) unifies video and text encoders via introducing a token rolling operation to encode temporal representations from videos. Omnivorous Masked Auto-Encoder(OmniMAE) (girdhar2023omnimae) shows that MAE can be used to pretrain a ViT on images and videos without any human labels. OmniVec (srivastava2024omnivec) also pretrains a unified architecture from self-supervised masked data, including visual, audio, text, and 3D, which realizes the cross-modal task generalization.
3. Miscellaneous
3.1. Fairness and Bias in MTL
While most of the existing research about bias and fairness implications primarily focuses on STL (mehrabi2021survey), wang2021understanding pioneer the exploration of the fairness-accuracy trade-off within the MTL setting. The challenge of unaligned fairness goals arises in MTL models that optimize accuracy for all tasks. The introduction of novel multi-task fairness metrics, such as average relative fairness gap and average relative error, aids in quantifying this trade-off in MTL applications. li2023fairness emphasize that misspecification of majority and minority groups in involved tasks disproportionately affects minority tasks, and they propose over-parameterization as a viable solution to achieve fairness by covering all tasks. hu2023fairness extend the definition of Strong Demographic Parity (agarwal2019fair; jiang2020wasserstein) to MTL using multi-marginal Wasserstein barycenters (chzhen2020fair), providing an optimal fair multi-task solution to the fairness-accuracy trade-off. Additionally, roy2022learning further demonstrates that improving fairness can positively impact accuracy performance. Learning to Teach Fair Multi-Tasking (L2T-FMT) (roy2022learning) introduces a teacher-student network to address fair MTL problems. In this framework, the teacher guides the student in selecting fairness or accuracy objectives during training, offering a dynamic approach to balancing these objectives. Drawing an analogy, roy2023fairbranch liken the negative impact of task-specific fairness to negative transfer and introduces FairBranch, a method that groups related tasks to mitigate this negative transfer through fairness loss gradient conflict correction. In recent years, prioritizing fair MTL to mitigate biases arising from negative transfer has emerged as a promising direction. This approach can ensure that models treat all tasks fairly, avoiding disproportionate impacts on specific groups or tasks. By preventing biased outcomes, fair MTL contributes to averting potential societal harm.
3.2. Security and Privacy in MTL
Attack and Defense. MTL is an impactful technique employed to bolster attacks in diverse sectors. It notably expedites the creation of adversarial examples for numerous tasks simultaneously through the exploitation of task-shared knowledge (guo2020multi). In the field of automatic speaker verification, multi-task learning strategies have been utilized to identify replay attack spoofing and to classify different types of replay noise (shim2018replay). With regard to reinforcement learning, the vulnerability of multi-task federated reinforcement learning algorithms to adversarial attacks has been examined, resulting in the development of an adaptable attack method and a refined federated reinforcement learning algorithm (anwar2021multi). Additionally, within the realm of deep reinforcement learning, a multi-objective strategy for developing attack policies has been suggested, considering both the performance degradation and the cost related to the attack (garcia2020learning). Conversely, MTL can also serve as a means to heighten the model’s resilience, leading to an improved defense against a wide array of malicious attacks. For instance, the robustness of models to adversarial attacks on individual tasks has been shown to increase when models are trained on multiple tasks concurrently (mao2020multitask; guo2020multi). Likewise, multi-task learning has been employed for adversarial defense (naseer2022stylized), using supplementary data from the feature space to design more formidable adversaries and boost the model’s resilience. Through the utilization of multi-task objectives, such as cross-entropy loss, feature-scattering, and margin losses, more powerful perturbations can be devised for adversarial training. This technique has been used in several domains, such as computer vision and speech recognition, and has demonstrated enhanced adversarial accuracy and resilience (pal2021adversarial; chan2021multiple).
Privacy-preserving. Privacy-preserving multi-task learning (PP-MTL) (liu2018privacy) aims to ensure the confidentiality of sensitive data and boost learning outcomes by facilitating knowledge transfer across related tasks. PP-MTL algorithms employ cryptographic mechanisms to safeguard data residing across various locations or nodes, using these to relay cumulative data - for instance, gradients or supports - to a centralized server where the aggregated data is processed to create the desired models. Existing strategies cannot deliver a demonstrable or verifiable security assurance for the transferred cumulative data. To tackle this shortcoming, various innovative PP-MTL protocols have been suggested, leveraging cutting-edge cryptographic methods to deliver the strongest possible security assurance (liu2018privacy). Furthermore, differential private stochastic gradient descent algorithms have been employed to optimize the comprehensive multi-task model and safeguard the privacy of training data by introducing appropriately calibrated noise to the gradient of loss functions (zhang2020privacy). To maintain the privacy of distributed data, privacy-preserving distributed MTL frameworks have been introduced, incorporating a privacy-preserving proximal gradient algorithm. This algorithm updates models asynchronously and offers guaranteed differential privacy (xie2017privacy).
Federated Learning. Federated Multi-task Learning (FMTL) (smith2017federated) represents a platform for training machine learning models over distributed device networks. By personalizing models for individual clients, it successfully navigates the statistical complexities posed by federated learning, given the heterogeneity of local data distributions (smith2017federated). It effectively manages high communication overhead, lags, and reliability in distributed multi-task learning (marfoq2021federated). The efficacy of FMTL has been demonstrated on real-world federated datasets, even with non-convex models (sarcheshmehpour2021networked). It can be utilized in both a central server-client and a fully decentralized structure and provides the capacity to serve personalized models to clients unseen during training (corinzia2019variational). Furthermore, the over-the-air computation can be integrated within FMTL to enhance system efficiency, reducing channel usage without a substantial drop in learning performance (ma2022over).
3.3. Distribution Shifts in MTL
While Multi-Task Learning (MTL) excels at leveraging shared information to boost individual task performance (1.3), its real-world applicability often hinges on its ability to adapt to unforeseen data distributions. Distribution shifts, where the data encountered during deployment deviates from the training distribution, are omnipresent challenges that can significantly degrade MTL performance, especially on new tasks or domains. Recognizing and mitigating these shifts is crucial not just for maintaining the generalizability and resilience of MTL models but also for unlocking their full potential in real-world applications.
Recent research offers a diverse arsenal of approaches to tackle distribution shifts in MTL. Vision Transformer Adapters (ViTA) (bhattacharjee2023vision) introduce dedicated modules within the model architecture that enhance adaptability to diverse tasks and data distributions. Techniques like regularizing spurious correlations (hu2022improving) target misleading associations between tasks, reducing their influence on the overall model performance. Scalarization methods provide a scalable framework for handling the complexities of multi-task and multi-domain learning while facing distribution shifts (royer2023scalarization). Multi-objective learning strategies, exemplified by approaches addressing catastrophic forgetting in time-series applications (10.1145/3502728), strive to mitigate the issue of forgetting previously learned skills when encountering new data. Finally, techniques like reward modeling (faal2023reward) demonstrate their versatility in addressing distribution shifts, as seen in mitigating toxicity issues in transformer-based language models. This array of advancements underscores the ongoing efforts to equip MTL models with enhanced adaptability and resilience to varying task distributions, ultimately paving the way for their reliable and widespread real-world application.
Looking ahead, the evolving landscape of MTL research envisions models that not only react to distribution shifts but proactively anticipate and address them. As highlighted in a recent comprehensive study (adhikarla2023robust), understanding and mitigating distribution shifts are becoming paramount for MTL’s success. The ability to navigate diverse and dynamic data distributions is crucial for the broader deployment of MTL in complex, real-world scenarios. By advancing techniques that enhance adaptability and robustness, researchers are striving to empower MTL models to excel in the face of evolving task and domain landscapes, unlocking their potential to revolutionize a wide array of applications.
3.4. Non-supervised MTL
semi-supervised learning. Supervised learning has been a fundamental technique in machine learning in recent years. However, it faces the limitation of requiring a substantial amount of labeled data to yield promising results, a process that is both time-consuming and costly. To mitigate this, semi-supervised learning has been introduced, leveraging the diverse array of unlabeled datasets to reduce the dependence on labeled data. Previous existing semi-supervised algorithms are not often amenable to MTL, for instance, (liu2007semi) introduces a semi-supervised multitask learning (MTL) framework, featuring parameterized classifiers. Each classifier is associated with a partially labeled data manifold and is jointly learned under a soft-sharing prior that influences their parameters. This approach effectively utilizes unlabeled data by basing the learning of classifiers on neighborhood structures. Besides, (augenstein2018multi) presents a method that models the relationship between labels by inducing a joint label embedding space for multi-task learning and proposes a which learns to transfer labels between tasks and uses semi-supervised learning to leverage them for training. In real-world applications, multi-task regression is a prevalent challenge. (zhang2009semi) proposes the SMTR method, which is grounded in Gaussian Processes (GP). This method operates under the assumption that the kernel parameters for all tasks share a common prior. To enhance SMTR, the approach incorporates unlabeled data by modifying the GP prior’s kernel function into a data-dependent one. This modification leads to a semi-supervised extension of the original SMTR method, aptly named SSMTR. Additionally, (chen2020multi) introduces a multi-task mean teacher model for semi-supervised shadow detection, effectively utilizing unlabeled data and simultaneously learning multiple aspects of shadows. Specifically, they construct a multi-task baseline model designed to detect shadow regions, edges, and count, leveraging the complementary information of these elements. This baseline model is then implemented in both student and teacher networks. The approach further involves aligning the predictions from the three tasks across these networks, using this alignment to compute a consistency loss on unlabeled data. This loss is combined with the supervised loss from labeled data based on the predictions of the multi-task baseline model, thereby enhancing the model’s learning effectiveness. (nguyen2019multi) proposed a network employing a multi-task learning approach to detect manipulated images and videos and to identify the manipulated regions within each query. To enhance the network’s generalizability, a semi-supervised learning approach is integrated in which the architecture comprises an encoder and a Y-shaped decoder. The activation of encoded features facilitates binary classification. Meanwhile, the outputs of the decoder’s branches serve distinct purposes: one for segmenting the manipulated regions and the other for reconstructing the input. This dual functionality significantly contributes to the improvement of the overall performance of the network. Semi-supervised multitask learning (MTL) has emerged as a popular field, with various preceding studies, as mentioned above, that propose different mechanisms that integrate semi-supervised concepts. These studies have demonstrated their effectiveness through numerous experimental results. Despite these advancements, there remains a substantial scope for further research in this subfield. Continued exploration in semi-supervised MTL promises to yield many more valuable insights and findings.
unsupervised learning. Moving beyond the realm of semi-supervised learning, the real-world often presents scenarios where obtaining labeled data of all tasks in MTL learning is not feasible, underscoring the significance of unsupervised learning in the field of multitask learning (MTL). OpenAI, in their groundbreaking study by (radford2019language), introduced the widely acclaimed GPT model, demonstrating a significant advancement in multitask learning (MTL) within the field of natural language processing. Their research showed that language models begin to autonomously learn a variety of MTL tasks - including question answering, machine translation, reading comprehension, and summarization - without the need for explicit supervision. This capability was notably observed when the GPT model was trained on , a vast new dataset comprising millions of webpages. This development highlights a major stride in the field, showcasing the potential of large language models to adapt to a wide array of tasks through extensive unsupervised learning. Besides, to alleviate the limitation of existing clustering approaches that neglect the underlying relationship and treat these clustering tasks either individually or simply together, the study by (5360241) introduces an innovative clustering approach called , which conducts several related clustering tasks concurrently and leverages the relationships between these tasks to improve clustering performance. This approach comprises two key components: (1) Within-task clustering, which involves clustering the data for each task individually within its own input space, and (2) Cross-task clustering, where the shared subspace is learned simultaneously, and the data from all tasks are clustered together. This dual-faceted strategy optimizes the clustering results by combining individual task insights with cross-task synergies. Another notable example is in the context of point cloud tasks, where (hassani2019unsupervised) introduces an unsupervised multi-task model. This model is designed to concurrently learn point and shape features. It incorporates three unsupervised tasks: clustering, reconstruction, and self-supervised classification. These tasks are used to train a multi-scale graph-based encoder. Beyond, (argyriou2006multi) introduces a method for learning a low-dimensional representation shared across multiple related tasks. This method extends the well-known 1-norm regularization problem by incorporating a novel regularizer that controls the number of features common to all tasks. The authors demonstrate that this approach can be formulated as a convex optimization problem and develop an iterative algorithm to solve it. The algorithm operates in a dual-step manner: it alternates between a supervised step and an unsupervised step. In the unsupervised step, it learns representations common across tasks, while in the supervised step, it utilizes these common representations to learn task-specific functions. This approach effectively combines supervised and unsupervised learning techniques to enhance multi-task learning.
3.5. Others
3.5.1. Applications of MTL
In the DL era, the advancement of multimodal analysis and MTL paradigms has brought challenges and also opened up fantastic probabilities to the realm of MTL. In addition to the applications investigated in the paper, MTL plays an important role in many different fields such as visual assessment (yu2019towards; zhang2023blind), healthcare(zhang2023knowledge; zhao2023multi; zhang2023biomedgpt), transportation(wang2023multi; feng2023forecast), language models (liu2020multi; hu2021unit) and recommender systems(zhang2023advances; deng2023unified). Briefly,zhang2023blind develop a general and automated multitask learning scheme for image quality assessment by blind individuals. zeng2023new combine MTL algorithms with a deep belief network for the diagnosis of Alzheimer’s disease. Wang et al. (wang2023multi) propose a multi-task Weakly supervised learning framework to infer transition probability between road segments. gao2023enhanced utilize the relation-aware GCNs to fully capture the multi-relation neighborhood features.
Despite the achievements in recent years, many outstanding MTL approaches still suffer from limitations that restrict their application to certain real-world scenarios. For example, it is difficult to capture the complex inter-scenario correlations with multiple tasks. Besides, in large-scale tasks, it remains a challenge to design scalable models and deal with the parameter explosion issue. Therefore, the scalability of MTL models is still a direction worth exploring (zhang2023advances).
3.5.2. MTL+X
MTL + Continual Learning. Biased forgetting of previous knowledge caused by new tasks remains challenging in continual learning. lyu2021multi propose Multi-Domain Multi-Task (MDMT) rehearsal to train the old tasks and new tasks together while keeping tasks from isolation. he2019task utilize meta-learning to achieve task-agnostic continual learning. MTL is a promising technique to mitigate catastrophic forgetting via learning task-relatedness.
Multi-Task Reinforcement Learning (MTRL). MTRL (vithayathil2020survey) holds promise in the context of Reinforcement Learning (RL), given the natural presence of diverse tasks like reach, push and pick in robotic manipulation. In the early stage, wilson2007multi approaches it as the solution to a sequence of Markov Decision Processes (MDPs) and employs a hierarchical Bayesian framework to infer the characteristics of new environments based on knowledge gained from previous environments. hessel2019multi introduce a method to automatically adjust the contribution of each task to the updates of a single agent. This ensures that all tasks exert a similar impact on the learning dynamics. taiga2022investigating investigates multi-task pretraining and generalization in RL. cheng2023multi propose an attention-based multi-task reinforcement learning approach to learn a compositional policy for each task.
4. Resources
In this section, we offer useful tools and resources that can help researchers and practitioners implement MTL models.
Dataset | Source | Year | Modality | Task | Synopsis | #Task | #Sample | Availability |
School Data | ILEA | mortimore1988school | Table | Regression | Predicting student exam scores based on 27 school features. | 139 | 15,362 | Official |
SARCOS Data | Humanoid Robotics | 2000 | Table | Regression | Estimate inverse dynamics model. | 7 | 44,484/4449 | Official |
Computer Survey Data | Survey | lenk1996hierarchical | Table | Regression | Likelihood of purchasing personal computers. | 179 | - | - |
Climate Dataset | Sensor network | 2017-now | Table | Regression | Real-time climate data collected from four climate stations. | 7 | - | Official |
20 Newsgroups | Netnews articles | Lang95 | Text | Classification | Hierarchical text classification. | 20 | 19,000 | Official |
Reuters-21578 Collection | Reuters | 1996 | Text | Classification | Reuters news documents with hierarchical categories. | 90 | 21,578 | Official |
MultiMNIST Dataset | MNIST | sabour2017dynamic | Image | Classification | Classify the digits on the different positions. | 2 | - | Official |
ImageCLEF-2014 | Caltech, ImageNet, Pascal, Bing | 2014 | Image | Classification | Benchmark dataset for domain adaptation. | 4 | 2,400 | Official |
Office-Caltech Dataset | Office, Caltech | gong2012geodesic | Image | Classification | Benchmark dataset for the annotation and retrieval of images. | 4 | 2,533 | Official |
Office-31 Dataset | Amazon, DSLR, Webcam | saenko2010adapting | Image | Classification | Objects commonly encountered in office settings. | 3 | 4,110 | Official |
Office-Home Dataset | Office | venkateswara2017deep | Image | Classification | Object recognition and domain adaptation in the era of deep learning. | 4 | 15,588 | Official |
DomainNet Dataset | UDA | peng2019moment | Image | Classification | Multi-source unsupervised domain adaptation research | 6 | 600,000 | Official |
EMMa Dataset | Amazon | standley2023extensible | Image, Text | Classification | Amazon product listings for category prediction | - | 2,800,000 | Official |
SYNTHIA Dataset | European Union | ros2016synthia | Image | Classification | A synthetic dataset for semantic segmentation. | - | 13,400 | Official |
SVHN Dataset | Stanford | yang2021few | Image | Classification | A digit classification benchmark dataset. | - | 600,000 | Official |
CelebA Dataset | MMLAB | liu2018large | Image | Classification | A large-scale face attributes dataset. | 40 | 200,000 | Official |
CityScapes Dataset | Daimler AG | cordts2016cityscapes | Image | Dense prediction | Semantic urban scene understanding | - | 5,000 | Official |
NYU-Depth Dataset V2 | New York University | silberman2012indoor | Image | Dense prediction | Indoor scene understanding with per-pixel labels | 3 | 35,064 | Official |
PASCAL VOC Project | University of Oxford | everingham2010pascal | Image | Dense prediction | Object recognition with multiple tasks | - | - | Official |
Taskonomy Dataset | Standard | zamir2018taskonomy | Image | Dense prediction | Diverse dataset with 26 tasks for task transfer learning | 26 | 4,000,000 | Official |
STREET | Amazon | ribeiro2023street | Text | Reasoning | The multi-task structured reasoning and explanation benchmark | - | - | - |
VKITTI2 Dataset | Naver | cabon2020virtual | Video | Segmentation | A video dataset which is automatically labeled with ground truth | 5 | - | Official |
XTREME | Carnegie Mellon | hu2020xtreme | Text | Translation, QA | A multilingual benchmark for evaluating cross-lingual generalisation | 9 | 400,000 | - |
Deepfashion Dataset | Shopping Websites | liu2016deepfashion | Image | Classification | A large-scale clothes dataset with comprehensive annotations | 2 | 800,000 | Official |
ACE05 Dataset | News | 2005 | Text | Classification | A large corpus with annotated entities, relations and events | 3 | 52,615 | Official |
ATIS Dataset | Airline | hemphill-etal-1990-atis | Text | Classification | A dataset with 17 unique intent categories. | 3 | 5,871 | Official |
4.1. Dataset
In this section, we introduce benchmark datasets for MTL from a taxonomic perspective. Specifically, based on the different datasets spawning a series of typical data-driven models, we classify many MTL datasets into three categories: regression task, classification task, and dense prediction task.
4.1.1. Regression task
Synthetic Data. This dataset is often artificially defined by researchers, thus different from one another, e.g. caruana1997multitask; bakker2003task; evgeniou2004regularized; argyriou2008convex; jalali2010dirty; zhou2011clustered; titsias2011spike; zhang2012convex; maurer2013sparse; han2016multi; parra2017spectral; nie2018calibrated; ma2018modeling, to name a few. The features are often generated via drawing random variables from a shared distribution and adding irrelevant variants from other distributions, and the corresponding responses are produced by a specific computational method. In such a manner, data in different tasks would contain both the task-specific and -shared features that contribute to the learning for estimation.
School Data. mortimore1988school comes from the Inner London Education Authority (ILEA) and contains records of student examination, which are described by student- and school-specific features from secondary schools. The goal is to predict exam scores from features, and the prediction in schools would be generally handled as tasks.
SARCOS Data.1515152000. SARCOS Data. gaussianprocess.org/gpml/data This dataset is in humanoid robotics consists of training examples and test examples. The goal of learning is to estimate the inverse dynamics model of a degrees-of-freedom (DOF) SARCOS anthropomorphic robot arm, each of which corresponds to a task and contains 21 features—7 joint positions, 7 joint velocities, and 7 joint accelerations. Computer Survey Data. lenk1996hierarchical is from a survey on the likelihood (11-point scale from 0 to 10) of purchasing personal computers. There are computer models as examples, each of which contains 13 computer descriptions (e.g., price, CPU speed, and screen size) and 6 subject-level covariates (e.g., gender, computer knowledge, and work experience) as features and ratings of subjects as targets, i.e., tasks. Climate Dataset.1616162017-now. Climate Dataset. www.cambermet.co.uk This real-time dataset is collected from a sensor network (e.g., anemometer, thermistor, and pressure transducer) of four climate stations—Cambermet, Chimet, Sotonmet and Bramblemet—in the south on England, which can represent tasks as needed. The archived data are reported in 5-minute intervals, including climate signals (e.g., wind speed, wave period, barometric pressure, and water temperature). Generally, air temperature is considered as the dependent variable and others as independent (parra2017spectral; zhao2019multiple).
4.1.2. Classification task
20 Newsgroups. Lang95 is a collection of approximately netnews articles, organized into hierarchical newsgroups according to the topic, such as root categories (e.g., comp, rec, sci, and talk) and sub-categories (e.g., comp.graphics, sci.electronics, and talk.politics.guns). Users can design different combinations as multiple text classifications tasks (he2011graphbased; tan2015transitive; zhang2018multi; mao2020adaptive; xiao2020efficient).
Reuters-21578 Collection.1717171996. Reuters-21578 Collection. www.daviddlewis.com/resources/testcollections/reuters21578/ This text collection contains 21578 documents from Reuters newswire dating back to 1987. These documents were assembled and indexed with more than 90 correlated categories—5 top categories (i.e., exchanges, orgs, people, place, topic), and each of them includes variable sub-categories. Users can independently define the related multiple tasks by choosing different combinations of categories, e.g., zheng2020multi; xiao2021new provide more detailed descriptions.
CelebA Dataset. CelebFaces Attributes Dataset (CelebA) (liu2018large) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including 10,177 identities, 202,599 face images, and 5 landmark locations, 40 binary attribute annotations per image. The dataset can be employed as the training and test sets for the following computer vision tasks: face attribute recognition, face recognition, face detection, landmark (or facial part) localization, and face editing & synthesis. MultiMNIST Dataset. This dataset originated from validating a capsule system (sabour2017dynamic), but it is also a MTL version of MNIST dataset (lecun1998gradient). By overlaying multiple images together, traditional digit classification is converted to an MTL problem, where classifying the digits in different positions is considered as distinctive task. sener2018multi contributes a standard construction for the research community. ImageCLEF-2014 Dataset.1818182014. ImageCLEF-2014. www.imageclef.org/2014/adaptation This dataset is a benchmark for domain adaptation challenge, which contains images of 12 common categories selected from 4 domains: Caltech 256, ImageNet 2012, Pascal VOC 2012, and Bing. These 4 domains are commonly considered as different tasks in MTL.
Office-Caltech Dataset. gong2012geodesic is a standard benchmark for domain adaption in computer vision, consisting of real-world images of 10 common categories from the Office dataset and Caltech-256 dataset. There are images from 4 distinct domains/tasks: Amazon, DSLR, Webcam, and Caltech.
Office-31 Dataset. saenko2010adapting consists of 4,110 images from 31 object categories across 3 domains/tasks: Amazon, DSLR, and Webcam.
Office-Home Dataset. venkateswara2017deep is collected for object recognition to validate domain adaptation models in the era of DL, which includes images in office and home settings (e.g., alarm clock, chair, eraser, keyboard, telephone, etc.) organized into 4 domains/tasks: Art (paintings, sketches and artistic depictions), Clipart (clipart images), Product (product images from www.amazon.com), and Real-World (real-world objects captured with a regular camera).
DomainNet Dataset. peng2019moment is annotated for the purpose of multi-source unsupervised domain adaptation (UDA) research. It contains million images from 345 categories across 6 distinct domains, e.g., sketch, infograph, quickdraw, real, etc.
SYNTHIA Dataset. ros2016synthia is a synthetic dataset created to address the need for a large and diverse collection of images with pixel-level annotations for vision-based semantic segmentation in urban scenarios, particularly for autonomous driving applications. It consists of precise pixel-level semantic annotations for 13 classes, including sky, building, road, sidewalk, fence, vegetation, lane-marking, pole, car, traffic signs, pedestrians, cyclists, and miscellaneous objects.
SVHN Dataset. Street View House Numbers (SVHN) (yang2021few) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits (from 0 to 9) cropped from pictures of house number plates. The cropped images are centered in the digit of interest, but nearby digits and other distractors are kept in the image. SVHN has three sets: training, testing sets and an extra set with 530,000 images that are less difficult and can be used for helping with the training process.
Deepfashion Dataset. DeepFashion (liu2016deepfashion) is a large-scale clothes dataset with comprehensive annotations. It contains over 800,000 images, which are richly annotated with massive attributes, clothing landmarks, and correspondence of images taken under different scenarios including store, street snapshot, and consumer.
ACE05 Dataset.1919192005. ACE05 Dataset. catalog.ldc.upenn.edu/LDC2006T06 The ACE 2005 Multilingual Training Corpus comprises the comprehensive collection of training data in English, Arabic, and Chinese for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus includes diverse data types that have been annotated for entities, relations, and events. The Linguistic Data Consortium (LDC), with support from the ACE Program and additional assistance from LDC, carried out the annotation of this dataset.
ATIS Dataset. The ATIS (Airline Travel Information Systems) dataset (hemphill-etal-1990-atis) comprises audio recordings along with corresponding manual transcripts of human interactions with automated airline travel inquiry systems. These interactions involve individuals seeking flight-related information. The dataset includes 17 distinct intent categories representing different user intents. In the original data split, the training set contains 4,478 intent-labeled reference utterances, the development set contains 500 utterances, and the test set contains 893 utterances.
4.1.3. Dense prediction task
CityScapes Dataset. cordts2016cityscapes consists of 5,000 images with high-quality annotations and 20,000 images with coarse annotations from 50 different cities, which contains 19 classes for semantic urban scene understanding. Specifically, pixel-wise semantic and instance segmentation together with ground truth inverse depth labels are often used as three different tasks (kendall2018multi; liu2019end) in MTL. NYU-Depth Dataset V2. silberman2012indoor is comprised of 1,449 images from 464 indoor scenes across 3 cities, which contains 35,064 distinct objects of 894 different classes. The dense per-pixel labels of class, instance, and depth are used in many computer vision tasks, e.g., semantic segmentation, depth prediction, and surface normal estimation (eigen2015predicting). PASCAL VOC Project. 2020202005. Pascal VOC Project. host.robots.ox.ac.uk/pascal/VOC This project (everingham2010pascal) provides standardized image datasets for object class recognition and also has run challenges evaluating performance on object class recognition from 2005 to 2012, where VOC072121212007. Pascal VOC Challenge 2007. host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html, VOC082222222008. Pascal VOC Challenge 2008. host.robots.ox.ac.uk/pascal/VOC/voc2008/index.html, and VOC122323232012. Pascal VOC Challenge 2012. host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html are commonly used for MTL research. The multiple tasks cover classification, detection (e.g., body part, saliency, semantic edge), segmentation, attribute prediction (farhadi2009describing), surface normals prediction (maninis2019attentive), etc. Many of the annotations are labeled or distilled by the followers (chen2014detect; maninis2019attentive).
Taskonomy Dataset. zamir2018taskonomy is currently the most diverse product for computer vision in MTL, consisting of 4 million samples from 3D scans of buildings. This product is a dictionary of 26 tasks (e.g., 2D, 2.5D, 3D, semantics, etc.) as a computational taxonomic map for task transfer learning. Accordingly, Tiny-Tasknomy (standley2020tasks) with 5 sampled dense prediction tasks, e.g., semantic segmentation, surface normal prediction, depth prediction, keypoint detection, and edge detection is considered a commonly used benchmark in MTL.
4.1.4. Others
EMMa Dataset. EMMa Dataset (standley2023extensible) comprises more than 2.8 million objects from Amazon product listings, each annotated with images, listing text, mass, price, product ratings, and its position in Amazon’s product-category taxonomy. It includes a comprehensive taxonomy of 182 physical materials, and objects are annotated with one or more materials from this taxonomy. EMMa offers a new benchmark for multi-task learning in computer vision and NLP, allowing for the addition of new tasks and object attributes at scale.
STREET. STREET (ribeiro2023street) is a multi-task benchmark for structured reasoning and explanations in NLP. It consists of five existing datasets (ARC, SCONE, GSM8K, AQUA-RAT, and AR-LSAT) and introduces a unified reasoning formulation with textual logical units and reasoning graphs. Evaluation metrics and empirical performance analysis using T5-large and GPT-3 models are provided, along with error explanations on a per-dataset basis.
VKITTI2 Dataset. Virtual KITTI (gaidon2016virtual) is a new video dataset, automatically labeled with accurate ground truth for object detection, tracking, scene and instance segmentation, depth, and optical flow. Virtual KITTI 2 (cabon2020virtual) is a more photo-realistic and better-featured version of the original virtual KITTI dataset. It exploits recent improvements of the Unity game engine and provides new data such as stereo images or scene flow.
XTREME. The XTREME (Cross-lingual Transfer Evaluation of Multilingual Encoders) (hu2020xtreme) benchmark is a multi-task evaluation framework to assess the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. It highlights the performance disparity between models tested on English, which achieve human-level performance on numerous tasks, and cross-lingually transferred models, which exhibit a significant performance gap, particularly in syntactic and sentence retrieval tasks.
Library | Sprache | Supported Methods |
RMTL | R | Sparse structure learning (tibshirani1996regression), multi-task feature selection (obozinski2006multi), low rank MTL (ji2009accelerated; pong2010trace), graph-based regularised MTL (widmer2010leveraging), multi-task clustering (gu2009learning) |
MALSAR | Matlab | Sparse structure learning (tibshirani1996regression), regularized MTL (evgeniou2004regularized), multi-task feature selection (obozinski2006multi), dirty block-sparse model (jalali2010dirty), low rank MTL (ji2009accelerated; pong2010trace), convex ASO (chen2009convex), sparse & low rank MTL (chen2012learning), clustered MTL (zhou2011clustered), robust MTL (chen2011integrating), robust multi-task feature learning (gong2012robust), Temporal group Lasso (zhou2011multi), convex fused sparse group Lasso (zhou2012modeling), incomplete multi-source feature learning (yuan2012multi), multi-stage multi-task feature learning (gong2012multi), multi-task clustering (gu2009learning) |
LibMTL | Python | Cross-stitch (misra2016cross), GradNorm (chen2018gradnorm), Uncertainty Weighting (kendall2018multi), MGDA-MTL (sener2018multi), MMoE (ma2018modeling), MultiNet++ (chennupati2019multinet++), LTB (guo2020learning), MTAN & DWA (liu2019end), PCGrad (yu2020gradient), GradDrop (chen2020just), CGC & PLE (tang2020progressive), IMTL (liu2021towards), GradVac (wang2021gradient), CAGrad (liu2021conflictaverse), DSelect-k (hazimeh2021dselect), RLW & RGW (lin2022reasonable), Nash-MTL (navon2022multi) |
4.2. Software Resources
To provide playgrounds for researchers to fairly compare different state-of-the-art algorithms in a unified environment, open-source platforms for MTL merge out. Herein we introduce three popular software resources that aim at variant populations in terms of the implementation languages, algorithm comprehensiveness, downstream task realms, and modularization focuses.
Regularized Multi-Task Learning (RMTL).242424cran.r-project.org/web/packages/RMTL/index.html It is a relatively small yet practical R library for MTL, especially for the ones on biological-related tasks. It includes ten algorithms applicable for regression, classification, joint predictor selection, task clustering, low-rank learning and incorporation of biological networks.
Multi-tAsk Learning via StructurAl Regularization (MALSAR).252525github.com/jiayuzhou/MALSAR It is a MTL package implemented with Matlab. Compared to RMTL, it does not particularly focus on a certain field yet includes more algorithms. In MALSAR, it implements 14 models with 26 of their variations to test their effectiveness.
Library for Multi-Task Learning (LibMTL).262626github.com/median-research-group/LibMTL It is a comprehensive open-source Python library built on PyTorch for MTL. There are 104 MTL models combined by 8 architectures and 13 loss weighting strategies in LibMTL. Moreover, it guarantees unified and consistent evaluations among models on three computer vision datasets. Different from the above packages, LibMTL is well-modularized and supports customization over different components such as loss weighting strategies or architectures.
4.3. Evaluation Metric
4.3.1. Single-task Metric
In this section, we will introduce some single-task metrics that can be used to evaluate the performance of individual tasks in a multi-task learning (MTL) setup.
Regression Task Metric
Root Mean Squared Error (RMSE): RMSE is a commonly used metric to measure the average prediction error in regression tasks. It calculates the square root of the average of squared differences between predicted and true values. RMSE gives higher weights to larger errors, making it sensitive to outliers. It is calculated as:
where represents the true value, denotes the predicted value, and stands for the total number of samples.
Mean Absolute Percentage Error (MAPE): MAPE is a metric used to evaluate the accuracy of predictions in percentage terms. It measures the average percentage difference between predicted and true values. This metric is commonly used in business forecasting tasks. It is calculated as:
Symmetric Mean Absolute Percentage Error (SMAPE): SMAPE is similar to MAPE but has the advantage of being symmetric, meaning it treats overestimations and underestimations equally. It calculates the average percentage difference between predicted and true values, considering the absolute sum of both. It is calculated as:
Coefficient of Determination (R-squared): is a statistical metric that represents the proportion of variance in the dependent variable (the target) that is predictable from the independent variable (the prediction). It indicates how well the predicted values fit the actual data. It is calculated as:
where is the mean of the true values .
Classification Task Metric
Confusion Matrix: A confusion matrix is a table that allows visualization of the performance of a classification model. It presents the number of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) predictions. The confusion matrix is usually represented as follows:
(121) |
Accuracy: Accuracy is one of the most straightforward classification metrics, representing the proportion of correctly classified instances over the total number of instances in the dataset. It is calculated as:
Precision: Precision is a metric that measures the proportion of true positive predictions (correctly predicted positive instances) over the total number of positive predictions made by the model. It is calculated as:
Recall (Sensitivity or True Positive Rate - TPR): Recall calculates the proportion of true positive predictions (correctly predicted positive instances) over the total number of actual positive instances in the dataset. It is calculated as:
F1 Score: The F1 score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is especially useful when there is an uneven class distribution. It is calculated as:
Specificity (True Negative Rate): Specificity measures the proportion of true negative predictions (correctly predicted negative instances) over the total number of actual negative instances in the dataset. It is calculated as:
Precision-Recall Curve: The precision-recall curve is a graphical representation of the tradeoff between precision and recall for different classification thresholds. It plots the precision on the y-axis against the recall on the x-axis as the threshold varies.
Area Under the Receiver Operating Characteristic Curve (AUC-ROC): AUC-ROC is a metric that evaluates the performance of a binary classification model across various discrimination thresholds. It represents the area under the ROC curve, where ROC stands for the Receiver Operating Characteristic.
Formula: The AUC-ROC is typically computed using various threshold values to calculate the True Positive Rate (TPR) and False Positive Rate (FPR) at each threshold. The AUC-ROC is then obtained by plotting TPR against FPR and calculating the area under the curve.
Object Detection Task Metric
Bounding Box: In object detection, algorithms typically predict bounding boxes and class labels for objects in an image. A bounding box is represented by a set of four coordinates: , which define the top-left and bottom-right corners of the box.
Intersection Over Union (IoU): The IoU measures the overlap between the predicted bounding box and the ground truth bounding box . It is defined as:
True Positive (TP), False Positive (FP), and False Negative (FN): - A detection is considered a TP if the IoU with the ground truth exceeds a given threshold (typically ) and the class label matches. - A detection is an FP if the IoU is below this threshold, or if there is no corresponding ground truth. - An FN represents a ground truth box which had no detected box surpassing the IoU threshold.
Precision:
Recall:
Mean Average Precision (mAP): The mAP is a widely-used metric in object detection, averaging the precision values at different recall levels across all classes.
Precision-Recall Curve for Object Detection: This curve plots precision against recall values for different IoU thresholds, offering insights into a detection model’s performance.
Average Recall (AR): AR averages the recall values obtained at various IoU thresholds.
Image Segmentation Metrics
Pixel Accuracy: Pixel accuracy is a simple metric that measures the proportion of pixels that are correctly classified. For a given image or set of images, it is defined as the ratio of correctly classified pixels to the total number of pixels.
Boundary F1 Score (BF): The Boundary F1 Score evaluates the accuracy of the boundaries in a segmentation task. Given predicted boundaries and ground truth boundaries , the BF score is the F1 score (harmonic mean of precision and recall) calculated based on the detected boundary pixels.
Panoptic Quality (PQ): The Panoptic Quality metric combines segmentation (things and stuff) and detection (things only) into a single score. It is defined as:
Where is the precision and is the recall for each matched region . is number of matched regions. is number of false positive regions. is number of false negative regions.
Image Generation Metrics
Peak Signal-to-Noise Ratio (PSNR): PSNR is a traditional quality metric used to measure the quality of a reconstructed image compared to an original image. Higher values of PSNR indicate better quality. It is defined as:
Where is the maximum possible pixel value of the image (often for an 8-bit image), and is the Mean Squared Error between the original and the reconstructed image.
Structural Similarity Index Measure (SSIM): SSIM measures the structural similarity between two images. It provides a more perceptual-based assessment of image quality than PSNR. A value of 1 indicates the images are identical in terms of structural information.
Where and are two images, represents the mean, represents the variance, is the covariance of and , and and are constants to avoid instability when the denominator is close to zero.
Inception Score (IS): The Inception Score is used to evaluate the quality and diversity of generated images in GANs. A higher IS indicates both better image quality and greater diversity. It’s calculated using a pre-trained Inception model.
Where is an image, is the label predicted by the Inception model, and is the Kullback-Leibler divergence.
Fréchet Inception Distance (FID): FID measures the similarity between the generated images and real images. It computes the Fréchet distance between two Gaussians fitted to the feature representations of the Inception network for both sets of images. Lower FID scores indicate that the two sets of images are more similar, implying better generation quality.
Where are the mean and covariance of the feature representations for real images and are those for generated images.
Text Generation Metrics
BLEU (Bilingual Evaluation Understudy): BLEU is a metric originally designed for machine translation but is also used in text generation. It measures how many n-grams in the generated text match the n-grams in the reference text(s). The score ranges between 0 and 1, with 1 being a perfect match.
Where are the weights for each n-gram (typically ), is the precision of n-grams, and is the maximum n-gram order.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Primarily used for evaluating summary generation, ROUGE measures the overlap between the n-grams in the generated text and the reference text(s).
Where is the number of matching n-grams between the generated text and reference summary, and Count is the number of n-grams in the reference summary.
Perplexity: Used for evaluating language models, perplexity measures how well the probability distribution predicted by the model aligns with the true distribution of the words in the text. Lower perplexity values indicate better model performance.
Where is the total number of words, and is the model’s predicted probability for word .
Self-BLEU: A metric that evaluates the diversity of generated texts. It computes the BLEU score between each generated text and all other generated texts. Lower Self-BLEU scores indicate higher diversity.
Distinct-N: Measures the diversity of generated content by computing the ratio of unique n-grams to the total number of generated n-grams. Higher values of Distinct-N indicate greater diversity.
4.3.2. Multi-task Metric
In this section, we denote by and the STL measurements of MTL method and STL baseline for the -th task, respectively. indicates that a lower value has better performance for the measurement , and vice versa.
Delta (dong2015multi)
The performance of MTL method can be simply defined as the difference of the STL measurement between the STL baseline and MTL method:
(122) |
where was set to be BLEU-4 (papineni2002bleu) in dong2015multi,.
MTL gain (tang2020progressive)
To evaluate the benefit of MTL method over the STL baseline on the -th task, MTL gain is computed as below:
(123) |
which is consistent with any positive or negative measurements (c.f. Delta (dong2015multi)).
(maninis2019attentive)
The performance of MTL method can be quantified by calculating the average per-task drop with respect to the single-task baseline using STL measurements:
(124) |
(lin2022reasonable)
Given that many single tasks can be measured by several metrics, e.g. semantic segmentation measured by mIoU and pixacc, by following (maninis2019attentive), the average of the relative improvement over the MTL method on each metric of each task could be formulated as the MTL performance measurement:
(125) |
where is the number of metrics used for the -th task. denotes the -th performance measurement of the baseline method, e.g. the STL or vanilla MTL method, for the -th task.
5. Discussion
In this section, we will discuss several key questions and explore future directions concerning the theories and applications of MTL.
Multi-Task Pretraining. While MTL has demonstrated its remarkable success in real-world scenarios, delving into its underlying mechanisms becomes even more imperative in the era of PFMs. When data in the wild are pre-trained using scalable foundation models to exhibit modality- and task-agnostic characteristics (§ 2.3), an essential question arises: What proportions of different tasks in the pretraining phase can yield best task-generalizable performance?
Competitive or Collaborative? While many proposed MTL methods offer benefits to each task under their specific settings, competitive tasks continue to exist in real-world scenarios. Distinguishing between them without human priors before employing MTL remains a challenge. Task prior sharing (§ 2.1.5) and task clustering methods (§ 2.1.6) can play a crucial role, as they can help to know task relations and do not conflict with other multi-task representation learning methods.
Blessed or Cursed by Large Number of Tasks? While MTL with a small number of tasks has been proven to outperform STL, and MTL with a large number of tasks has been demonstrated to be learnable, the underlying relationships between these models and the number of tasks raise intriguing questions. The introduction of a new task typically introduces both knowledge and noise to existing tasks. If all tasks are trained equally, (e.g., LLMs), without any selective mechanisms, what are the outcomes for the final learned model concerning each individual task?
MTL for Other Things. The pursuit of performance through MTL has been shown to have potential drawbacks in terms of fairness (§ 3.1), security and privacy (§ 3.2). However, MTL can also contribute to learning fairness or enhancing security and privacy for involved tasks by incorporating novel metrics. In certain situations, a favorable trade-off between these considerations may exist.
Illuminating the Unseen with MTL: To underscore the impactful insights provided by MTL, consider a compelling example where MTL results significantly advanced our understanding of a complex problem. In a medical imaging scenario, MTL was applied to simultaneously predict multiple health-related outcomes, such as disease progression, severity, and patient response to treatment. Unlike STL approaches, MTL unveiled intricate dependencies and interactions between these outcomes, showcasing that certain imaging features played dual roles in influencing multiple health aspects. This holistic perspective allowed researchers to identify subtle correlations and nuanced patterns that were previously obscured by individual task-centric analyses. MTL, in this case, not only improved predictive accuracy but also unraveled hidden intricacies within the data, providing a richer and more comprehensive understanding of the medical conditions under investigation. This example exemplifies how MTL can reveal intricate relationships and enhance interpretability beyond the capabilities of traditional STL methods.
6. Conclusion
In this survey, we introduce the MTL from rough to precise and review methodologies covering traditional ML, DL, and PFMs era. First, we present the background of MTL, covering the scope, formal definition, comparisons with other paradigms, and motivations behind MTL. After that, we explore how MTL works well and provide the reasons to explain its intrinsic mechanisms. We formalize and illustrate MTL in a framework and further expand the methodology overview based on this MTL framework. Specifically, we summarize the sparse structure learning, feature learning, low-rank learning, and decomposition methods in the traditional learning era. We categorize MTL in DL into feature sharing, task balancing, and neural architecture search methods; recent task- and modality-agnostic foundation models are also discussed as they can learn universal comprehensiveness across tasks with different data modalities.
To sum it up, MTL methods in the traditional learning era prefer to "drop" distinctive (task-specific) features to seek consensus. For instance, the classical norm can realize grouped feature selection across tasks to exploit common features that are effective and efficient for joint performance enhancement. Another example is the low-rank learning methods that try to explore common underlying representations via imposing low-dimensional properties for essential factors, where a small set of factors is supposed to govern multiple tasks. However, when it comes to DL models, powerful computational resources make it possible to handle all the features from different tasks, and its hierarchical structure with multiple layers can learn feature interaction across tasks at various levels of abstraction. Accordingly, MTL has been dominated by feature fusing and task-balancing techniques via introducing learnable parameters in the past decade. These learnable parameters play a crucial role in cross-task communication and eavesdropping during the combined training. However, the explanations and mechanisms of these complicated interactions inside the networks still remain poorly understood. More recently, unified foundation models have shown promising results for MTL in real-world scenarios, as data with versatile modalities can be trained simultaneously to learn universal and effective comprehensiveness.
Overall, we hope this paper provides an extensive review of the research community for a comprehensive understanding of research advances, current and future challenges, and opportunities or prospects for the MTL.
Disclosure Statement
The authors have no conflicts of interest to declare.
Acknowledgments
This paper is the result of a collaborative effort, with each author contributing significantly to various aspects:
- •
-
•
Xiaokang Liu contributed by writing and organizing the section on MTL via low-rank factorization (§ 2.1.3).
-
•
Jin Huang was responsible for the figure and layout designs, ensuring visual clarity and coherence.
-
•
Yishan Shen focused on developing the MTL through prior sharing, as outlined in § 2.1.5.
-
•
Ke Zhang was instrumental in writing and structuring the Graph-based MTL section (§ 2.2.9).
-
•
Rong Zhou authored the STL metrics section and played a key role in organizing parts of the datasets.
-
•
Eashan Aahikarla delved deeply into the distribution shifts that occur in MTL (§ 3.3).
-
•
Wenxuan Ye took charge of organizing the GitHub website for this project, facilitating broader access and collaboration.
-
•
Yixin Liu was pivotal in developing the security and privacy section for the MTL framework, as detailed in § 3.2.
-
•
Zhaoming Kong and Kai Zhang were actively involved in discussions about the scope and structure of this survey.
-
•
Jun Yu initiated this project in 2021 and managed the contents not specifically mentioned above, providing overall leadership and direction.
-
•
Prof. Moore, Prof. Davison, Prof. Namboodiri and Prof. Yin contributed significantly by offering feedback and suggestions during the paper’s development.
-
•
Prof. Chen finalizes the paper structure, edited different versions of the manuscript, and tailored the materials towards the audiences of the research community.
All authors above actively participated in the proofreading and discussion stages of this paper. We extend our sincere gratitude to all for their valuable contributions and collective effort in bringing this research to this final version.