Accelerating Multi-Block Constrained Optimization Through Learning to Optimize

Ling Liang  Cameron Austin
Department of Mathematics
University of Maryland
College Park
[email protected]
[email protected] &Haizhao Yang
Department of Mathematics
Department of Computer Science
University of Maryland
College Park
[email protected]
Equal contribution.Corresponding author.
Abstract

Learning to Optimize (L2O) approaches, including algorithm unrolling, plug-and-play methods, and hyperparameter learning, have garnered significant attention and have been successfully applied to the Alternating Direction Method of Multipliers (ADMM) and its variants. However, the natural extension of L2O to multi-block ADMM-type methods remains largely unexplored. Such an extension is critical, as multi-block methods leverage the separable structure of optimization problems, offering substantial reductions in per-iteration complexity. Given that classical multi-block ADMM does not guarantee convergence, the Majorized Proximal Augmented Lagrangian Method (MPALM), which shares a similar form with multi-block ADMM and ensures convergence, is more suitable in this setting. Despite its theoretical advantages, MPALM’s performance is highly sensitive to the choice of penalty parameters. To address this limitation, we propose a novel L2O approach that adaptively selects this hyperparameter using supervised learning. We demonstrate the versatility and effectiveness of our method by applying it to the Lasso problem and the optimal transport problem. Our numerical results show that the proposed framework outperforms popular alternatives. Given its applicability to generic linearly constrained composite optimization problems, this work opens the door to a wide range of potential real-world applications.

Keywords Learning to Optimize  \cdot Multi-block ADMM  \cdot LASSO Problem  \cdot Optimal Transport Problem

1 Introduction

Optimization is a fundamental and essential process in machine learning and data science [4]. It involves fine-tuning algorithms, models, or systems to make them perform at their best on complex tasks. The goal of optimization is to find the best solution from a set of possible choices (i.e., feasible set), often with the aim of minimizing or maximizing a particular function (i.e., loss function). In the context of machine learning, optimization often refers to the process of adjusting a model’s parameters to minimize the error or loss function. This is typically achieved through various classical algorithmic frameworks such as gradient descent (GD) [8], stochastic gradient descent (SGD) [50], and more advanced methods like Adam [31] and RMSprop [27], to mention just a few. Machine learning approaches can play significant roles in designing efficient optimization algorithms and greatly improving the performance of well-designed algorithms. This field of research is known as Learning to Optimize (L2O) which attracts growing attention [14]. In particular, given a class of optimization problems, machine learning approaches can be used to effectively predict the performance of different hyperparameters and configurations of an optimization algorithm, thereby guiding the search for optimal configurations and even automating the process of tuning [29, 20].

1.1 Optimization model

Following the research scheme of L2O, we consider leveraging machine learning techniques in solving the following class of generic multi-block convex composite optimization problems:

minyi𝕐i,i=1,,pfξ(y1,,yp)+g(y1)s.t.i=1p𝒜iyi=c,\min_{y_{i}\in\mathbb{Y}_{i},\;i=1,\dots,p}\;\;f_{\xi}(y_{1},\dots,y_{p})+g(y_% {1})\quad\mathrm{s.t.}\quad\sum_{i=1}^{p}\mathcal{A}^{*}_{i}y_{i}=c,roman_min start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_p end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + italic_g ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_s . roman_t . ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c , (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ ))

where 𝕐isubscript𝕐𝑖\mathbb{Y}_{i}blackboard_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is finite dimensional Euclidean space for i=1,,p𝑖1𝑝i=1,\dots,pitalic_i = 1 , … , italic_p, g:𝕐1(,+]:𝑔subscript𝕐1g:\mathbb{Y}_{1}\to(-\infty,+\infty]italic_g : blackboard_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → ( - ∞ , + ∞ ] is a closed proper convex (possibly nonsmooth) function, fξ:𝕐1××𝕐p(,+):subscript𝑓𝜉subscript𝕐1subscript𝕐𝑝f_{\xi}:\mathbb{Y}_{1}\times\dots\times\mathbb{Y}_{p}\to(-\infty,+\infty)italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT : blackboard_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × blackboard_Y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT → ( - ∞ , + ∞ ) is a continuously differentiable convex function that depends on a random variable ξΞ𝜉Ξ\xi\in\Xiitalic_ξ ∈ roman_Ξ sampled from a fixed distribution 𝒫𝒫\mathcal{P}caligraphic_P, c𝕏𝑐𝕏c\in\mathbb{X}italic_c ∈ blackboard_X is a given vector in the finite dimensional Euclidean space 𝕏𝕏\mathbb{X}blackboard_X, and 𝒜i:𝕏𝕐i:subscript𝒜𝑖𝕏subscript𝕐𝑖\mathcal{A}_{i}:\mathbb{X}\to\mathbb{Y}_{i}caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_X → blackboard_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a linear mapping for i=1,,p𝑖1𝑝i=1,\dots,pitalic_i = 1 , … , italic_p. For notational simplicity, we denote y:=(y1;;yp)T𝕐:=𝕐1××𝕐passign𝑦superscriptsubscript𝑦1subscript𝑦𝑝𝑇𝕐assignsubscript𝕐1subscript𝕐𝑝y:=(y_{1};\dots;y_{p})^{T}\in\mathbb{Y}:=\mathbb{Y}_{1}\times\dots\times% \mathbb{Y}_{p}italic_y := ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; … ; italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_Y := blackboard_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × blackboard_Y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Moreover, we define the linear mapping 𝒜:𝕏𝕐:𝒜𝕏𝕐\mathcal{A}:\mathbb{X}\to\mathbb{Y}caligraphic_A : blackboard_X → blackboard_Y as 𝒜:=(𝒜1;;𝒜p)assign𝒜subscript𝒜1subscript𝒜𝑝\mathcal{A}:=(\mathcal{A}_{1};\dots;\mathcal{A}_{p})caligraphic_A := ( caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; … ; caligraphic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ). Then, we see that i=1p𝒜iTyi=𝒜ysuperscriptsubscript𝑖1𝑝subscriptsuperscript𝒜𝑇𝑖subscript𝑦𝑖superscript𝒜𝑦\sum_{i=1}^{p}\mathcal{A}^{T}_{i}y_{i}=\mathcal{A}^{*}y∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT caligraphic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y, for all y𝕐𝑦𝕐y\in\mathbb{Y}italic_y ∈ blackboard_Y. An optimization problem of the form (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )) is of great interest, mainly due to its excellent modeling power. Indeed, many optimization problems from practical applications in the fields of statistical and machine learning, engineering, and image and signal processing can be formulated as instances of (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )). These applications include composite convex quadratic conic programming [35, 37], penalized and constrained regression [30], compressed sensing and sparse coding [55, 13, 19, 34], matrix and tensor completion [7, 38], regularized optimal transport [48, 58], and consensus optimization and federated learning [5, 61]. Note that for these modern applications in the era of big data, the number of blocks p𝑝pitalic_p and the dimension of a 𝕐isubscript𝕐𝑖\mathbb{Y}_{i}blackboard_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be large. Moreover, in practice, one usually needs to solve a sequence of optimization problems of the same form (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )) as the random variable ξ𝜉\xiitalic_ξ being sampled from the distribution 𝒫𝒫\mathcal{P}caligraphic_P. Hence, solving these optimization problems efficiently at scale is essential for practical considerations.

As a convex optimization problem, (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )) can be solved via applying existing algorithms designed for convex optimization. High-order methods, including interior point methods (IPMs) [45] and augmented Lagrangian methods (ALMs) [26, 21, 52], are commonly adopted due to their fast convergence rates and reliability in computing highly accurate solutions. Typically, at each iteration of a high-order method, one needs to solve a linear system or a convex (composite) quadratic programming subproblem, leading to excessive computational time per iteration. This can make the algorithm not scalable for large-scale problems. To design scalable algorithmic frameworks, recent years have witnessed significant advancements in developing and analyzing first-order methods (FOMs) for solving (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )). Given the presence of a potentially nonsmooth regularization term and linear constraints, prevalent first-order methods in machine learning, such as GD, SGD, Adam, and RMSprop, cannot be directly applied to the interested model (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )).

One of the most preferred approaches is the two-block alternating direction method of multipliers (ADMM) [22], which is a robust and scalable first-order method used for solving large-scale linearly constrained convex optimization problems; see Appendix A for more details. Though convergent under mild conditions, the two-block ADMM views y=(y1,,yp)𝑦subscript𝑦1subscript𝑦𝑝y=(y_{1},\dots,y_{p})italic_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) as a single block, ignoring potential separable structures which may be beneficial to explore. Moreover, solving the corresponding subproblem at each iteration involving the whole y𝑦yitalic_y can be time-consuming, leading to high iteration complexity of the two-block ADMM. This motivates the direct extension of the two-block ADMM to the multi-block ADMM which favors the separable structure of y𝑦yitalic_y in the objective and constraints; see Appendix B. At each iteration of the multi-block ADMM, p𝑝pitalic_p subproblems of smaller sizes are solved individually, following a Jacobi-type updating rule. This results in reduced per-iteration complexity. However, the convergence of the above direct extension is no longer guaranteed [11]. To resolve this issue, a multi-block ADMM-type method known as the Majorized Proximal Augmented Lagrangian Method (MPALM) was proposed recently by [12]. The key ingredient of MPALM is to replace the Jacobi-type updating rule with a symmetric Gauss-Seidel-type updating rule, ensuring elegant convergence properties, under mild conditions.

1.2 Our contributions

The practical performance of the MPALM can be sensitive to the value of the augmented Lagrangian penalty parameter. Hence, to get satisfactory numerical performance, one needs to adaptively adjust its value. However, the adjustment can be highly problem-dependent and requires domain expertise [33]. Given the assumption that ξΞ𝜉Ξ\xi\in\Xiitalic_ξ ∈ roman_Ξ is sample from a fixed distribution 𝒫𝒫\mathcal{P}caligraphic_P, developing data-driven frameworks for selecting the parameter is of significant interest to yield excellent empirical performance and enhance the applicability and efficiency of ADMM-type algorithms for solving challenging optimization problems of the form (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )). This paper aims at integrating data-driven machine learning techniques into the MPALM for solving the class of optimization problems (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )). Specifically, the contributions of this paper are summarized as follows.

  • We propose a simple yet effective L2O framework for learning the hyperparameter of a majorized proximal augmented Lagrangian method, a convergent multi-block ADMM-type method for solving the challenging optimization problems in the form of (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )). Our work continues the research theme in L2O and enhances the applicability of machine learning techniques in designing efficient optimization algorithms for generic constrained optimization, which is not fully explored in the literature.

  • Numerically, we consider two important practical applications, the Lasso problem and the discrete optimal transport problem, and showcase that the proposed framework is highly effective via numerical experiments and comparisons with state-of-the-art approaches. Given the excellent modeling power of (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )), our work further motivates various potential real-world applications from other domains.

2 Related work

Learn to optimize (L2O). L2O is a modern and effective approach towards designing optimization algorithms that reach a new level of efficiency. A comprehensive survey of L2O can be found in [14]. Our work is closely related to different subjects of L2O, including the configuration and hyperparameter learning [29, 20], plug-and-play approaches [56], and algorithm unrolling [43]. Particularly, a series of works targeting unrolling algorithmic frameworks for the Lasso model [55] (which as well as its dual problem are special cases of (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ ))) has attracted increasing attention [24, 44, 47, 62, 1, 17, 25, 39]. The fundamental algorithmic framework in these works is the Iterative Shrinkage Thresholding Algorithm (ISTA) [3]. Despite being characterized by hyperparameter learning, our approach can also be viewed as the application of the algorithm unrolling to the MPALM. In this way, our work continues this line of research and provides an alternative L2O approach for efficiently solving challenging optimization problems from practical applications, including Lasso-type problems.

ADMM-type methods. Incorporating machine learning techniques with ADMM-type methods has drawn growing attention in recent years. The two-block ADMM was employed as the fundamental algorithmic framework in image denoising, in which the involved proximal operators are replaced with learned operators [56, 6, 10, 53]. Great empirical success and comprehensive theoretical guarantees have been achieved in these works, making the plug-and-play ADMM approach one of the most popular and effective algorithms for image science. Just like unrolling the ISTA, unrolling the two-block ADMM and its linearized variant (also known as the primal dual hybrid gradient method [9]) has also attracted much attention; see e.g., [54, 49, 2, 17, 15, 57] and references therein. The unrolled two-block ADMM-type methods have also been demonstrated to have excellent empirical performance. There are other machine learning techniques that are helpful for improving the practical performance of the two-block ADMM. For instance, [60] successfully applied a reinforcement learning approach for selecting the hyperparameters in the two-block ADMM for distributed optimal power flow. While fruitful results have been established for L2O with two-block ADMM-type methods, results on combining L2O with multi-block ADMM-type methods, including MPALM, remain limited. We demonstrate in the present paper that the MPALM-based L2O approach is also effective when compared with existing state-of-the-art L2O approaches.

Optimal transport (OT). Recent years have seen a blossoming of interest in developing efficient solution methods for OT, which have numerous important applications, due partly to the essential metric property of its optimal solution [18, 48]. To solve OT problems efficiently at scale, the most popular first-order method is perhaps the Sinkhorn’s algorithm [18]. Sinkhorn’s algorithm is a simple iterative method for finding optimal solutions of entropy-regularized OT problems. Thus, it only provides approximate solutions to the original OT problems. To get high quality approximations, one needs to choose a small entropy regularization parameter, which causes the convergence of the Sinkhorn’s algorithm to be extremely slow. Moreover, a small entropy regularization parameter may also result in numerical issues, making the Sinkhorn’s algorithm unstable. Hence, developing effective algorithmic frameworks for solving OT problems remains an active research direction [23, 41, 32, 40, 16, 28, 59, 58]. However, works focusing on L2O approaches for OT remain limited, to the best of our knowledge. Our work provides a feasible approach for solving OT problems reliably via combining L2O with MPALM.

3 The majorized proximal augmented Lagrangian method

The following notation will be used throughout this paper. Let 𝔼𝔼\mathbb{E}blackboard_E be a finite dimensional Euclidean space, the standard inner product is denoted as ,\left\langle\cdot,\cdot\right\rangle⟨ ⋅ , ⋅ ⟩ and the associated induced norm is denoted as delimited-∥∥\left\lVert\cdot\right\rVert∥ ⋅ ∥. Let S:𝔼𝔼:𝑆𝔼𝔼S:\mathbb{E}\to\mathbb{E}italic_S : blackboard_E → blackboard_E be a self-adjoint positive semidefinite linear operator, the weighted-norm associated with S𝑆Sitalic_S is denoted as S\|\cdot\|_{S}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, i.e., for any x𝔼𝑥𝔼x\in\mathbb{E}italic_x ∈ blackboard_E, xS=x,Sxsubscriptdelimited-∥∥𝑥𝑆𝑥𝑆𝑥\left\lVert x\right\rVert_{S}=\sqrt{\left\langle x,Sx\right\rangle}∥ italic_x ∥ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = square-root start_ARG ⟨ italic_x , italic_S italic_x ⟩ end_ARG. For a differential function f:𝔼:𝑓𝔼f:\mathbb{E}\to\mathbb{R}italic_f : blackboard_E → blackboard_R, its gradient is denoted as f𝑓\nabla f∇ italic_f. If f:𝔼{±}:𝑓𝔼plus-or-minusf:\mathbb{E}\to\mathbb{R}\cup\{\pm\infty\}italic_f : blackboard_E → blackboard_R ∪ { ± ∞ } is an extended-valued convex function, then the effective domain of f𝑓fitalic_f is denoted as dom(f):={x:f(x)<}assigndom𝑓conditional-set𝑥𝑓𝑥\mathrm{dom}(f):=\{x\;:\;f(x)<\infty\}roman_dom ( italic_f ) := { italic_x : italic_f ( italic_x ) < ∞ }. Let xdom(f)𝑥dom𝑓x\in\mathrm{dom}(f)italic_x ∈ roman_dom ( italic_f ), then the subgradient of the convex function f𝑓fitalic_f at x𝑥xitalic_x is denoted as f(x):={v:f(x)f(x)+v,xx,x}assign𝑓𝑥conditional-set𝑣𝑓superscript𝑥𝑓𝑥𝑣superscript𝑥𝑥for-allsuperscript𝑥\partial f(x):=\{v\;:\;f(x^{\prime})\geq f(x)+\left\langle v,x^{\prime}-x% \right\rangle,\;\forall x^{\prime}\}∂ italic_f ( italic_x ) := { italic_v : italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ italic_f ( italic_x ) + ⟨ italic_v , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_x ⟩ , ∀ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }.

In this section, we provide a detailed introduction of the main algorithmic framework proposed in [12], namely the majorized proximal augmented Lagrangian method (MPALM), and explain how to cleverly choose the proximal term for the augmented Lagrangian function such that the resulting proximal ALM subproblem can be solved efficiently and analytically. The key idea of designing the proximal term is motivated by the Gauss-Seidel iterative method for solving symmetric positive definite linear systems.

Since the problem (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )) is a constrained convex optimization problem, we see that the first-order optimality conditions (also called the Karush-Kuhn-Tucker conditions) [51] for problem (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )) are given by

0g(y1)+fξ(y)+𝒜x,𝒜yc=0.formulae-sequence0𝑔subscript𝑦1subscript𝑓𝜉𝑦𝒜𝑥superscript𝒜𝑦𝑐00\in\partial g(y_{1})+\nabla f_{\xi}(y)+\mathcal{A}x,\quad\mathcal{A}^{*}y-c=0.0 ∈ ∂ italic_g ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y ) + caligraphic_A italic_x , caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y - italic_c = 0 . (1)

For any point (x,y)𝕏×𝕐superscript𝑥superscript𝑦𝕏𝕐(x^{*},y^{*})\in\mathbb{X}\times\mathbb{Y}( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ blackboard_X × blackboard_Y that satisfies the above first-order optimality conditions, one can show that ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a solution to problem (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )) and xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the corresponding Lagrange multiplier (also known as the dual optimal solution). For the rest of this paper, we assume that the following assumption holds.

Assumption 1.

The first-order optimality conditions (1) admit at least one solution (x,y)𝕏×𝕐superscript𝑥superscript𝑦𝕏𝕐(x^{*},y^{*})\in\mathbb{X}\times\mathbb{Y}( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ blackboard_X × blackboard_Y. Such a tuple is called a KKT solution.

Assumption 1 is commonly employed in the literature, as it presented one of the weakest conditions ensuring the solvability of the problem (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )). Assuming the solvability of the first-order optimality conditions (1), one may apply the classic augmented Lagrangian method for solving (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )). To this end, given a penalty parameter σ>0𝜎0\sigma>0italic_σ > 0, we define the augmented Lagrangian function as

σ(y;x):=fξ(y)+g(y1)+𝒜yc,x+σ2𝒜yc2,(y,x)𝕐×𝕏.formulae-sequenceassignsubscript𝜎𝑦𝑥subscript𝑓𝜉𝑦𝑔subscript𝑦1superscript𝒜𝑦𝑐𝑥𝜎2superscriptdelimited-∥∥superscript𝒜𝑦𝑐2for-all𝑦𝑥𝕐𝕏\mathcal{L}_{\sigma}(y;x):=f_{\xi}(y)+g(y_{1})+\left\langle\mathcal{A}^{*}y-c,% x\right\rangle+\frac{\sigma}{2}\left\lVert\mathcal{A}^{*}y-c\right\rVert^{2},% \quad\forall(y,x)\in\mathbb{Y}\times\mathbb{X}.caligraphic_L start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_y ; italic_x ) := italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y ) + italic_g ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ⟨ caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y - italic_c , italic_x ⟩ + divide start_ARG italic_σ end_ARG start_ARG 2 end_ARG ∥ caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y - italic_c ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ ( italic_y , italic_x ) ∈ blackboard_Y × blackboard_X .

Then given an initial point x0𝕏superscript𝑥0𝕏x^{0}\in\mathbb{X}italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ blackboard_X and an increasing sequence of penalty parameter {σk}subscript𝜎𝑘\{\sigma_{k}\}{ italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, the classical augmented Lagrangian iteratively performs the following updating scheme:

{yk+1:=argmin{σk(y;xk):y𝕐},xk+1:=xk+σk(𝒜yk+1c),casesassignsuperscript𝑦𝑘1argminconditional-setsubscriptsubscript𝜎𝑘𝑦superscript𝑥𝑘𝑦𝕐assignsuperscript𝑥𝑘1superscript𝑥𝑘subscript𝜎𝑘superscript𝒜superscript𝑦𝑘1𝑐\left\{\begin{array}[]{l}y^{k+1}:=\mathrm{argmin}\left\{\mathcal{L}_{\sigma_{k% }}(y;x^{k})\;:\;y\in\mathbb{Y}\right\},\\[5.0pt] x^{k+1}:=x^{k}+\sigma_{k}\left(\mathcal{A}^{*}y^{k+1}-c\right),\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT := roman_argmin { caligraphic_L start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ; italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) : italic_y ∈ blackboard_Y } , end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT := italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_c ) , end_CELL end_ROW end_ARRAY

where k0𝑘0k\geq 0italic_k ≥ 0 denotes the iteration counter. Though admitting excellent convergence properties [52], the classical augmented Lagrangian method faces certain challenges: (1) yk+1superscript𝑦𝑘1y^{k+1}italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT with high accuracy can be difficult to obtain since one needs to solve a (possibly nonsmooth) convex optimization problem. (2) The potential separable structure in y𝑦yitalic_y is not explicitly explored. We next show how to address these challenges via replacing the augmented Lagrangian function σsubscript𝜎\mathcal{L}_{\sigma}caligraphic_L start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT in the classical augmented Lagrangian method with a convex function (as the sum of a convex quadratic function and g𝑔gitalic_g) such that the resulting subproblems in updating y𝑦yitalic_y are much easier to solve via fully exploring the potential separable structure in y𝑦yitalic_y. To this end, we need to make the following mild assumption with respect to the function fξsubscript𝑓𝜉f_{\xi}italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT.

Assumption 2 (Majorization of fξsubscript𝑓𝜉f_{\xi}italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT).

There exists a fixed self-adjoint positive semidefinite linear operator Σ:𝕐𝕐:Σ𝕐𝕐\Sigma:\mathbb{Y}\to\mathbb{Y}roman_Σ : blackboard_Y → blackboard_Y such that, for any ξΞ𝜉Ξ\xi\in\Xiitalic_ξ ∈ roman_Ξ,

fξ(y)qξ(y;y):=fξ(y)+fξ(y),yy+12yyΣ2,y,y𝕐.formulae-sequencesubscript𝑓𝜉𝑦subscript𝑞𝜉𝑦superscript𝑦assignsubscript𝑓𝜉superscript𝑦subscript𝑓𝜉superscript𝑦𝑦superscript𝑦12superscriptsubscriptdelimited-∥∥𝑦superscript𝑦Σ2for-all𝑦superscript𝑦𝕐f_{\xi}(y)\leq q_{\xi}(y;y^{\prime}):=f_{\xi}(y^{\prime})+\left\langle\nabla f% _{\xi}(y^{\prime}),y-y^{\prime}\right\rangle+\frac{1}{2}\left\lVert y-y^{% \prime}\right\rVert_{\Sigma}^{2},\quad\forall\;y,y^{\prime}\in\mathbb{Y}.italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y ) ≤ italic_q start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y ; italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_y - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_y - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ italic_y , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_Y .

For later usage, we shall partition ΣΣ\Sigmaroman_Σ into p×p𝑝𝑝p\times pitalic_p × italic_p sub-blocks as Σ=[Σij]1i,jpΣsubscriptdelimited-[]subscriptΣ𝑖𝑗formulae-sequence1𝑖𝑗𝑝\Sigma=[\Sigma_{ij}]_{1\leq i,j\leq p}roman_Σ = [ roman_Σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 1 ≤ italic_i , italic_j ≤ italic_p end_POSTSUBSCRIPT,

where Σij:𝕐j𝕐i, 1i,jp:subscriptΣ𝑖𝑗formulae-sequencesubscript𝕐𝑗subscript𝕐𝑖formulae-sequence1𝑖𝑗𝑝\Sigma_{ij}:\mathbb{Y}_{j}\to\mathbb{Y}_{i},\;1\leq i,j\leq proman_Σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT : blackboard_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT → blackboard_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ≤ italic_i , italic_j ≤ italic_p, are linear mappings. We can see that Assumption 2 is indeed quite mild. For example, if fξsubscript𝑓𝜉f_{\xi}italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT has a Lipschitz continuous gradient for every ξΞ𝜉Ξ\xi\in\Xiitalic_ξ ∈ roman_Ξ with a uniformly bounded Lipschitz constant, then Assumption 2 holds. In the latter case, ΣΣ\Sigmaroman_Σ can be chosen as a diagonal matrix. Here, we allow non-diagonal ΣΣ\Sigmaroman_Σ in order to obtain a more accurate majorization of fξsubscript𝑓𝜉f_{\xi}italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT when defining the majorized augmented Lagrangian function 111For example, consider fξ(y)=12y,𝒬y+bξ,ysubscript𝑓𝜉𝑦12𝑦𝒬𝑦subscript𝑏𝜉𝑦f_{\xi}(y)=\frac{1}{2}\left\langle y,\mathcal{Q}y\right\rangle+\left\langle b_% {\xi},y\right\rangleitalic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⟨ italic_y , caligraphic_Q italic_y ⟩ + ⟨ italic_b start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT , italic_y ⟩ where 𝒬0succeeds-or-equals𝒬0\mathcal{Q}\succeq 0caligraphic_Q ⪰ 0 does not depend on ξ𝜉\xiitalic_ξ. We see that fξsubscript𝑓𝜉f_{\xi}italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT satisfies Assumption 2. In this case, we can set Σ=λmax(𝒬)IΣsubscript𝜆max𝒬𝐼\Sigma=\lambda_{\textrm{max}}(\mathcal{Q})Iroman_Σ = italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( caligraphic_Q ) italic_I or Σ=𝒬Σ𝒬\Sigma=\mathcal{Q}roman_Σ = caligraphic_Q. Obviously, the latter choice leads to an exact majorization of fξsubscript𝑓𝜉f_{\xi}italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT..

Then, for a given ξ𝒫similar-to𝜉𝒫\xi\sim\mathcal{P}italic_ξ ∼ caligraphic_P and the augmented Lagrangian penalty parameter σ>0𝜎0\sigma>0italic_σ > 0, we can define the majorized augmented Lagrangian function (with slightly abuse of notation \mathcal{L}caligraphic_L)

ξ,σ(y;x,y):=qξ(y;y)+g(y1)+𝒜yc,x+σ2𝒜yc2,(y,x,y)𝕐×𝕏×𝕐.formulae-sequenceassignsubscript𝜉𝜎𝑦𝑥superscript𝑦subscript𝑞𝜉𝑦superscript𝑦𝑔subscript𝑦1superscript𝒜𝑦𝑐𝑥𝜎2superscriptdelimited-∥∥superscript𝒜𝑦𝑐2for-all𝑦𝑥superscript𝑦𝕐𝕏𝕐\mathcal{L}_{\xi,\sigma}(y;x,y^{\prime}):=q_{\xi}(y;y^{\prime})+g(y_{1})+\left% \langle\mathcal{A}^{*}y-c,x\right\rangle+\frac{\sigma}{2}\left\lVert\mathcal{A% }^{*}y-c\right\rVert^{2},\quad\forall\;(y,x,y^{\prime})\in\mathbb{Y}\times% \mathbb{X}\times\mathbb{Y}.caligraphic_L start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_y ; italic_x , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := italic_q start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y ; italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_g ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ⟨ caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y - italic_c , italic_x ⟩ + divide start_ARG italic_σ end_ARG start_ARG 2 end_ARG ∥ caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y - italic_c ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ ( italic_y , italic_x , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ blackboard_Y × blackboard_X × blackboard_Y .

Using the majorized augmented Lagrangian function, the main algorithmic framework, namely the majorized proximal augmented Lagrangian method (ALM), for solving problem (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )) is presented in Algorithm 1.

Input: A fixed point ξΞ𝜉Ξ\xi\in\Xiitalic_ξ ∈ roman_Ξ, an initial point (x0,y0)𝕏×𝕐superscript𝑥0superscript𝑦0𝕏𝕐(x^{0},y^{0})\in\mathbb{X}\times\mathbb{Y}( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ blackboard_X × blackboard_Y, a self-adjoint linear mapping 𝒮:𝕐𝕐:𝒮𝕐𝕐\mathcal{S}:\mathbb{Y}\to\mathbb{Y}caligraphic_S : blackboard_Y → blackboard_Y, the penalty parameter σ>0𝜎0\sigma>0italic_σ > 0, the step size τ(0,2)𝜏02\tau\in(0,2)italic_τ ∈ ( 0 , 2 ), and the maximum number of iterations K>0𝐾0K>0italic_K > 0.
1
2for k=0,,K1𝑘0𝐾1k=0,\dots,K-1italic_k = 0 , … , italic_K - 1 do
3       yk+1=argmin{ϕξ,k(y):=ξ,σ(y;xk,yk)+12yyk𝒮2:y𝕐}superscript𝑦𝑘1argminconditional-setassignsubscriptitalic-ϕ𝜉𝑘𝑦subscript𝜉𝜎𝑦superscript𝑥𝑘superscript𝑦𝑘12superscriptsubscriptdelimited-∥∥𝑦superscript𝑦𝑘𝒮2𝑦𝕐y^{k+1}=\mathrm{argmin}\left\{\phi_{\xi,k}(y):=\mathcal{L}_{\xi,\sigma}(y;x^{k% },y^{k})+\frac{1}{2}\left\lVert y-y^{k}\right\rVert_{\mathcal{S}}^{2}\;:\;y\in% \mathbb{Y}\right\}italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y ) := caligraphic_L start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_y ; italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_y - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT : italic_y ∈ blackboard_Y }.
4      
5      xk+1=xk+τσ(𝒜yk+1c)superscript𝑥𝑘1superscript𝑥𝑘𝜏𝜎superscript𝒜superscript𝑦𝑘1𝑐x^{k+1}=x^{k}+\tau\sigma\left(\mathcal{A}^{*}y^{k+1}-c\right)italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_τ italic_σ ( caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_c ).
6      
7 end for
Output: xKsuperscript𝑥𝐾x^{K}italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.
Algorithm 1 The majorized proximal ALM.

Here, we assume that the subproblem in Line 2 of Algorithm 1 can be solved exactly. We shall see later that the two applications considered in this paper satisfy this assumption. However, we have to emphasize that solving the subproblem inexactly is more realistic and important; see e.g., [12]. The convergence properties of Algorithm 1 are summarized as follows. Note that under additional conditions, including a certain error-bound condition, Algorithm 1 can further be shown to have a local Q-linear convergence rate [12]. Since these additional conditions are generally not easy to verify, we decide not to present the result here for simplicity. However, such a linear convergence is empirically observed based on numerical experience.

Theorem 1.

Suppose that Assumptions 1 and 2 hold and 𝒮:𝕐𝕐:𝒮𝕐𝕐\mathcal{S}:\mathbb{Y}\to\mathbb{Y}caligraphic_S : blackboard_Y → blackboard_Y is a given self-adjoint linear operator such that 12Σ+σ𝒜𝒜+𝒮0succeeds12Σ𝜎𝒜superscript𝒜𝒮0\frac{1}{2}\Sigma+\sigma\mathcal{A}\mathcal{A}^{*}+\mathcal{S}\succ 0divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_Σ + italic_σ caligraphic_A caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + caligraphic_S ≻ 0, and 𝒮12Σsucceeds-or-equals𝒮12Σ\mathcal{S}\succeq-\frac{1}{2}\Sigmacaligraphic_S ⪰ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_Σ. Let {(xk,yk)}superscript𝑥𝑘superscript𝑦𝑘\{(x^{k},y^{k})\}{ ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) } be the sequence generated by Algorithm 1. Then, the sequence is bounded and converges to a KKT solution.

Given the above convergence properties presented in Theorem 1, the next essential task is to choose the linear mapping 𝒮:𝕐𝕐:𝒮𝕐𝕐\mathcal{S}:\mathbb{Y}\to\mathbb{Y}caligraphic_S : blackboard_Y → blackboard_Y such that the subproblems can be solved efficiently via exploring the separable structure associated with y𝑦yitalic_y. [36] provides an elegant approach for choosing 𝒮𝒮\mathcal{S}caligraphic_S based on the idea of solving symmetric positive definite linear systems by Gauss-Seidel iterative method. Here, we shall briefly describe how to achieve this goal. To this end, we consider the mapping 𝒮𝒮\mathcal{S}caligraphic_S which can be decomposed as the sum of two self-adjoint mappings, i.e., 𝒮:=𝒮~+𝒮^assign𝒮~𝒮^𝒮\mathcal{S}:=\widetilde{\mathcal{S}}+\widehat{\mathcal{S}}caligraphic_S := over~ start_ARG caligraphic_S end_ARG + over^ start_ARG caligraphic_S end_ARG where 𝒮~:=Diag(𝒮~11,,𝒮~pp)assign~𝒮Diagsubscript~𝒮11subscript~𝒮𝑝𝑝\widetilde{\mathcal{S}}:=\mathrm{Diag}(\widetilde{\mathcal{S}}_{11},\;\dots,% \widetilde{\mathcal{S}}_{pp})over~ start_ARG caligraphic_S end_ARG := roman_Diag ( over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ) is block-diagonal with 𝒮~ii:𝕐i𝕐i:subscript~𝒮𝑖𝑖subscript𝕐𝑖subscript𝕐𝑖\widetilde{\mathcal{S}}_{ii}:\mathbb{Y}_{i}\to\mathbb{Y}_{i}over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT : blackboard_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → blackboard_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for i=1,,p𝑖1𝑝i=1,\dots,pitalic_i = 1 , … , italic_p, and 𝒮^^𝒮\widehat{\mathcal{S}}over^ start_ARG caligraphic_S end_ARG to be determined shortly. Note that the objective function for the ALM subproblem in Algorithm 1 can be simplified as

ϕξ,k(y)=12y,(σ𝒜𝒜y+𝒮+Σ)y+fξ(yk)+𝒜xkσ𝒜c𝒮ykΣyk,y+g(y1)+const,subscriptitalic-ϕ𝜉𝑘𝑦12𝑦𝜎𝒜superscript𝒜𝑦𝒮Σ𝑦subscript𝑓𝜉superscript𝑦𝑘𝒜superscript𝑥𝑘𝜎𝒜𝑐𝒮superscript𝑦𝑘Σsuperscript𝑦𝑘𝑦𝑔subscript𝑦1const\phi_{\xi,k}(y)=\frac{1}{2}\left\langle y,\left(\sigma\mathcal{A}\mathcal{A}^{% *}y+\mathcal{S}+\Sigma\right)y\right\rangle+\left\langle\nabla f_{\xi}(y^{k})+% \mathcal{A}x^{k}-\sigma\mathcal{A}c-\mathcal{S}y^{k}-\Sigma y^{k},y\right% \rangle+g(y_{1})+\textrm{const},italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⟨ italic_y , ( italic_σ caligraphic_A caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y + caligraphic_S + roman_Σ ) italic_y ⟩ + ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + caligraphic_A italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_σ caligraphic_A italic_c - caligraphic_S italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - roman_Σ italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y ⟩ + italic_g ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + const , (2)

where ”const” means a quantity that does not depend on y𝑦yitalic_y. For later usage, we write 𝒬:=σ𝒜𝒜y+𝒮~+Σ:=𝒰+𝒟+𝒰assign𝒬𝜎𝒜superscript𝒜𝑦~𝒮Σassign𝒰𝒟superscript𝒰\mathcal{Q}:=\sigma\mathcal{A}\mathcal{A}^{*}y+\widetilde{\mathcal{S}}+\Sigma:% =\mathcal{U}+\mathcal{D}+\mathcal{U}^{*}caligraphic_Q := italic_σ caligraphic_A caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y + over~ start_ARG caligraphic_S end_ARG + roman_Σ := caligraphic_U + caligraphic_D + caligraphic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT,

where 𝒟:𝕐𝕐:𝒟𝕐𝕐\mathcal{D}:\mathbb{Y}\to\mathbb{Y}caligraphic_D : blackboard_Y → blackboard_Y and 𝒰:𝕐𝕐:𝒰𝕐𝕐\mathcal{U}:\mathbb{Y}\to\mathbb{Y}caligraphic_U : blackboard_Y → blackboard_Y are defined as

𝒟:=assign𝒟absent\displaystyle\mathcal{D}:=caligraphic_D := Diag(σ𝒜1𝒜1+Σ11+𝒮~11,,σ𝒜p𝒜p+Σpp+𝒮~pp),Diag𝜎subscript𝒜1superscriptsubscript𝒜1subscriptΣ11subscript~𝒮11𝜎subscript𝒜𝑝superscriptsubscript𝒜𝑝subscriptΣ𝑝𝑝subscript~𝒮𝑝𝑝\displaystyle\;\mathrm{Diag}\left(\sigma\mathcal{A}_{1}\mathcal{A}_{1}^{*}+% \Sigma_{11}+\widetilde{\mathcal{S}}_{11},\dots,\sigma\mathcal{A}_{p}\mathcal{A% }_{p}^{*}+\Sigma_{pp}+\widetilde{\mathcal{S}}_{pp}\right),roman_Diag ( italic_σ caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT + over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_σ caligraphic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT + over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ) ,
𝒰:=assign𝒰absent\displaystyle\mathcal{U}:=caligraphic_U := (0σ𝒜1𝒜2+Σ12σ𝒜1𝒜p1+Σ1,p1σ𝒜1𝒜p+Σ1p00σ𝒜2𝒜p1+Σ2,p1σ𝒜2𝒜p+Σ2p000σ𝒜p1𝒜p+Σp1,p0000).matrix0𝜎subscript𝒜1superscriptsubscript𝒜2subscriptΣ12𝜎subscript𝒜1superscriptsubscript𝒜𝑝1subscriptΣ1𝑝1𝜎subscript𝒜1superscriptsubscript𝒜𝑝subscriptΣ1𝑝00𝜎subscript𝒜2superscriptsubscript𝒜𝑝1subscriptΣ2𝑝1𝜎subscript𝒜2superscriptsubscript𝒜𝑝subscriptΣ2𝑝000𝜎subscript𝒜𝑝1superscriptsubscript𝒜𝑝subscriptΣ𝑝1𝑝0000\displaystyle\;\begin{pmatrix}0&\sigma\mathcal{A}_{1}\mathcal{A}_{2}^{*}+% \Sigma_{12}&\dots&\sigma\mathcal{A}_{1}\mathcal{A}_{p-1}^{*}+\Sigma_{1,p-1}&% \sigma\mathcal{A}_{1}\mathcal{A}_{p}^{*}+\Sigma_{1p}\\ 0&0&\dots&\sigma\mathcal{A}_{2}\mathcal{A}_{p-1}^{*}+\Sigma_{2,p-1}&\sigma% \mathcal{A}_{2}\mathcal{A}_{p}^{*}+\Sigma_{2p}\\ \vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&\cdots&0&\sigma\mathcal{A}_{p-1}\mathcal{A}_{p}^{*}+\Sigma_{p-1,p}\\ 0&0&\dots&0&0\end{pmatrix}.( start_ARG start_ROW start_CELL 0 end_CELL start_CELL italic_σ caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_σ caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_p - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT 1 , italic_p - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_σ caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT 1 italic_p end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL … end_CELL start_CELL italic_σ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_p - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT 2 , italic_p - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_σ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT 2 italic_p end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL italic_σ caligraphic_A start_POSTSUBSCRIPT italic_p - 1 end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_p - 1 , italic_p end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL … end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ) .

For the choice of 𝒮~~𝒮\widetilde{\mathcal{S}}over~ start_ARG caligraphic_S end_ARG, we only require that the following assumption holds.

Assumption 3 (Positive definiteness of 𝒬𝒬\mathcal{Q}caligraphic_Q).

𝒮~=Diag(𝒮~11,,𝒮~pp)~𝒮Diagsubscript~𝒮11subscript~𝒮𝑝𝑝\widetilde{\mathcal{S}}=\mathrm{Diag}(\widetilde{\mathcal{S}}_{11},\dots,% \widetilde{\mathcal{S}}_{pp})over~ start_ARG caligraphic_S end_ARG = roman_Diag ( over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ) is chosen appropriately such that

12Σii+σ𝒜i𝒜i+𝒮~ii0,i=1,,p,𝒮~12Σformulae-sequencesucceeds12subscriptΣ𝑖𝑖𝜎subscript𝒜𝑖superscriptsubscript𝒜𝑖subscript~𝒮𝑖𝑖0formulae-sequence𝑖1𝑝succeeds-or-equals~𝒮12Σ\frac{1}{2}\Sigma_{ii}+\sigma\mathcal{A}_{i}\mathcal{A}_{i}^{*}+\widetilde{% \mathcal{S}}_{ii}\succ 0,\;i=1,\dots,p,\quad\widetilde{\mathcal{S}}\succeq-% \frac{1}{2}\Sigmadivide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_Σ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT + italic_σ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT ≻ 0 , italic_i = 1 , … , italic_p , over~ start_ARG caligraphic_S end_ARG ⪰ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_Σ

Under Assumption 3, 𝒟𝒟\mathcal{D}caligraphic_D is positive definite and hence nonsingular. Using the above decomposition, we propose to choose 𝒮^:=𝒰𝒟1𝒰assign^𝒮𝒰superscript𝒟1superscript𝒰\widehat{\mathcal{S}}:=\mathcal{U}\mathcal{D}^{-1}\mathcal{U}^{*}over^ start_ARG caligraphic_S end_ARG := caligraphic_U caligraphic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which is called the SGS-operator for 𝒬𝒬\mathcal{Q}caligraphic_Q, denoted by SGS(𝒬)SGS𝒬\mathrm{SGS}(\mathcal{Q})roman_SGS ( caligraphic_Q ). With this particular choice of 𝒮^^𝒮\widehat{\mathcal{S}}over^ start_ARG caligraphic_S end_ARG, we can show in the following theorem that one cycle of the block symmetric Gauss-Seidel update exactly solves the ALM subproblem in Line 2 of Algorithm 1.

Theorem 2 (Minimizing the ALM subproblem [36, Theorem 1]).

Under Assumption 3, the minimizer

yk+1=argmin{ϕξ,k(y):y𝕐}superscript𝑦𝑘1argminconditional-setsubscriptitalic-ϕ𝜉𝑘𝑦𝑦𝕐y^{k+1}=\mathrm{argmin}\left\{\phi_{\xi,k}(y)\;:\;y\in\mathbb{Y}\right\}italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y ) : italic_y ∈ blackboard_Y }

can be computed exactly as follows:

y~ik=superscriptsubscript~𝑦𝑖𝑘absent\displaystyle\tilde{y}_{i}^{k}=over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = argmin{ξ,σ(y<ik,yi,y~>ik;xk,yk)+12yiyik𝒮~ii2:yi𝕐i},i=p,,2,formulae-sequenceargminconditional-setsubscript𝜉𝜎superscriptsubscript𝑦absent𝑖𝑘subscript𝑦𝑖subscriptsuperscript~𝑦𝑘absent𝑖superscript𝑥𝑘superscript𝑦𝑘12subscriptsuperscriptdelimited-∥∥subscript𝑦𝑖superscriptsubscript𝑦𝑖𝑘2subscript~𝒮𝑖𝑖subscript𝑦𝑖subscript𝕐𝑖𝑖𝑝2\displaystyle\;\mathrm{argmin}\left\{\mathcal{L}_{\xi,\sigma}(y_{<i}^{k},y_{i}% ,\tilde{y}^{k}_{>i};x^{k},y^{k})+\frac{1}{2}\left\lVert y_{i}-y_{i}^{k}\right% \rVert^{2}_{\widetilde{\mathcal{S}}_{ii}}\;:\;y_{i}\in\mathbb{Y}_{i}\right\},% \quad i=p,\dots,2,roman_argmin { caligraphic_L start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT > italic_i end_POSTSUBSCRIPT ; italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_i = italic_p , … , 2 ,
yik+1=superscriptsubscript𝑦𝑖𝑘1absent\displaystyle{y}_{i}^{k+1}=italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = argmin{ξ,σ(y<ik+1,yi,y~>ik;xk,yk)+12yiyik𝒮~ii2:yi𝕐i},i=1,,p.formulae-sequenceargminconditional-setsubscript𝜉𝜎superscriptsubscript𝑦absent𝑖𝑘1subscript𝑦𝑖subscriptsuperscript~𝑦𝑘absent𝑖superscript𝑥𝑘superscript𝑦𝑘12subscriptsuperscriptdelimited-∥∥subscript𝑦𝑖superscriptsubscript𝑦𝑖𝑘2subscript~𝒮𝑖𝑖subscript𝑦𝑖subscript𝕐𝑖𝑖1𝑝\displaystyle\;\mathrm{argmin}\left\{\mathcal{L}_{\xi,\sigma}(y_{<i}^{k+1},y_{% i},\tilde{y}^{k}_{>i};x^{k},y^{k})+\frac{1}{2}\left\lVert y_{i}-y_{i}^{k}% \right\rVert^{2}_{\widetilde{\mathcal{S}}_{ii}}\;:\;y_{i}\in\mathbb{Y}_{i}% \right\},\quad i=1,\dots,p.roman_argmin { caligraphic_L start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT > italic_i end_POSTSUBSCRIPT ; italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_i = 1 , … , italic_p .

Applying Theorem 2, one immediately obtains a symmetric Gauss-Seidel-based majorized proximal ALM (see Algorithm 2). Moreover, one can verify that that Assumption 3 ensures the convergence of the algorithm, as stated in Theorem 1. Consequently, we obtain an algorithm that takes a similar form of the multi-block ADMM with provable convergence. This is an appealing feature for practical applications since it allows one updating one block of the decision variables while keeping the remaining blocks fixed; see, e.g., the applications considered in Section 4. One can also observe that for i=2,p𝑖2𝑝i=2,\dots pitalic_i = 2 , … italic_p, solving the associated subproblems involves only solving linear systems, which can be done nearly exactly via elementary linear algebra routines. For i=1𝑖1i=1italic_i = 1, the computation of y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the proximal mapping of g𝑔gitalic_g. It is well-known that many important functions g𝑔gitalic_g admit analytical proximal mapping or can be approximate efficiently and accurately [46]. Thus, the associated ALM subproblem can also be solved in low computational costs.

4 Hyperparameter learning

Though Algorithm 1 has attractive convergence properties, its practical performance is sensitive to the choice of the penalty parameter σ𝜎\sigmaitalic_σ, based on our numerical experience. Practitioners typically adjust σ𝜎\sigmaitalic_σ dynamically in order to obtain better numerical performance [33]. From a high level point of view, the following algorithm with adaptive penalty parameters, i.e., Algorithm 2, is commonly adopted in practice.

Input: A fixed point ξΞ𝜉Ξ\xi\in\Xiitalic_ξ ∈ roman_Ξ, an initial point (x0,y0)𝕏×𝕐superscript𝑥0superscript𝑦0𝕏𝕐(x^{0},y^{0})\in\mathbb{X}\times\mathbb{Y}( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ blackboard_X × blackboard_Y, a self-adjoint linear mapping 𝒮~:𝕐𝕐:~𝒮𝕐𝕐\widetilde{\mathcal{S}}:\mathbb{Y}\to\mathbb{Y}over~ start_ARG caligraphic_S end_ARG : blackboard_Y → blackboard_Y satisfies Assumption 3, the penalty parameter {σj}subscript𝜎𝑗\{\sigma_{j}\}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }, the step size τ(0,2)𝜏02\tau\in(0,2)italic_τ ∈ ( 0 , 2 ), the maximum number of iterations K>0𝐾0K>0italic_K > 0, and a positive integer K0Ksubscript𝐾0𝐾K_{0}\leq Kitalic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_K.
1
2for k=0,,K1𝑘0𝐾1k=0,\dots,K-1italic_k = 0 , … , italic_K - 1 do
3      
4      Find j𝑗jitalic_j such that k[jK0,(j+1)K0)𝑘𝑗subscript𝐾0𝑗1subscript𝐾0k\in[jK_{0},(j+1)K_{0})italic_k ∈ [ italic_j italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_j + 1 ) italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and set σ=σj𝜎subscript𝜎𝑗\sigma=\sigma_{j}italic_σ = italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
5      
6      for i=p,,2𝑖𝑝2i=p,\dots,2italic_i = italic_p , … , 2 do
7             y~ik=argmin{ξ,σ(y<ik,yi,y~>ik;xk,yk)+12yiyik𝒮~ii2:yi𝕐i}superscriptsubscript~𝑦𝑖𝑘argminconditional-setsubscript𝜉𝜎superscriptsubscript𝑦absent𝑖𝑘subscript𝑦𝑖subscriptsuperscript~𝑦𝑘absent𝑖superscript𝑥𝑘superscript𝑦𝑘12subscriptsuperscriptdelimited-∥∥subscript𝑦𝑖superscriptsubscript𝑦𝑖𝑘2subscript~𝒮𝑖𝑖subscript𝑦𝑖subscript𝕐𝑖\tilde{y}_{i}^{k}=\mathrm{argmin}\left\{\mathcal{L}_{\xi,\sigma}(y_{<i}^{k},y_% {i},\tilde{y}^{k}_{>i};x^{k},y^{k})+\frac{1}{2}\left\lVert y_{i}-y_{i}^{k}% \right\rVert^{2}_{\widetilde{\mathcal{S}}_{ii}}\;:\;y_{i}\in\mathbb{Y}_{i}\right\}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = roman_argmin { caligraphic_L start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT > italic_i end_POSTSUBSCRIPT ; italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }.
8            
9       end for
10      
11      for i=1,,p𝑖1𝑝i=1,\dots,pitalic_i = 1 , … , italic_p do
12             yik+1=argmin{ξ,σ(y<ik+1,yi,y~>ik;xk,yk)+12yiyik𝒮~ii2:yi𝕐i}superscriptsubscript𝑦𝑖𝑘1argminconditional-setsubscript𝜉𝜎superscriptsubscript𝑦absent𝑖𝑘1subscript𝑦𝑖subscriptsuperscript~𝑦𝑘absent𝑖superscript𝑥𝑘superscript𝑦𝑘12subscriptsuperscriptdelimited-∥∥subscript𝑦𝑖superscriptsubscript𝑦𝑖𝑘2subscript~𝒮𝑖𝑖subscript𝑦𝑖subscript𝕐𝑖{y}_{i}^{k+1}=\mathrm{argmin}\left\{\mathcal{L}_{\xi,\sigma}(y_{<i}^{k+1},y_{i% },\tilde{y}^{k}_{>i};x^{k},y^{k})+\frac{1}{2}\left\lVert y_{i}-y_{i}^{k}\right% \rVert^{2}_{\widetilde{\mathcal{S}}_{ii}}\;:\;y_{i}\in\mathbb{Y}_{i}\right\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = roman_argmin { caligraphic_L start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT > italic_i end_POSTSUBSCRIPT ; italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }.
13            
14       end for
15      
16      xk+1=xk+τσ(𝒜yk+1c)superscript𝑥𝑘1superscript𝑥𝑘𝜏𝜎superscript𝒜superscript𝑦𝑘1𝑐x^{k+1}=x^{k}+\tau\sigma\left(\mathcal{A}^{*}y^{k+1}-c\right)italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_τ italic_σ ( caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_c ).
17      
18 end for
Output: xKsuperscript𝑥𝐾x^{K}italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.
Algorithm 2 The symmetric Gauss-Seidel-based majorized proximal ALM with adaptive penalty parameters.

However, the criterion guiding the adjustment of the penalty parameters can be highly heuristic, which depends on the problems being solved and often requires advanced domain knowledge. To alleviate this difficulty, we propose to apply the data-driven supervised learning approach for learning the penalty parameters {σj}subscript𝜎𝑗\{\sigma_{j}\}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } such that the resulted algorithm performs empirically well when the random variable ξ𝜉\xiitalic_ξ is sampled from a fixed distribution 𝒫𝒫\mathcal{P}caligraphic_P. In particular, for a fixed initial point (x0,y0)𝕏×𝕐superscript𝑥0superscript𝑦0𝕏𝕐(x^{0},y^{0})\in\mathbb{X}\times\mathbb{Y}( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ blackboard_X × blackboard_Y and a fixed step size τ(0,2)𝜏02\tau\in(0,2)italic_τ ∈ ( 0 , 2 ), we see that the output of Algorithm 2 depends only on the random variable ξΞ𝜉Ξ\xi\in\Xiitalic_ξ ∈ roman_Ξ and the penalty parameters {σj}subscript𝜎𝑗\{\sigma_{j}\}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }. To emphasize this dependency, we shall denote the output of Algorithm 2 as xK(ξ,{σj})superscript𝑥𝐾𝜉subscript𝜎𝑗x^{K}(\xi,\{\sigma_{j}\})italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_ξ , { italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } ), with slightly abuse of notation. Similarly, we denote x(ξ)superscript𝑥𝜉x^{*}(\xi)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ξ ) to be the optimal solution of problem (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )) depending on ξΞ𝜉Ξ\xi\in\Xiitalic_ξ ∈ roman_Ξ. In order to find a good strategy of selecting the values of the penalty parameters {σj}subscript𝜎𝑗\{\sigma_{j}\}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }, we consider solving the following empirical risk minimization problem:

min{σj}1Ni=1NxK(ξ(j),{σj})x(ξ(j))2,ξ(1),,ξ(N)𝒫.similar-tosubscriptsubscript𝜎𝑗1𝑁superscriptsubscript𝑖1𝑁superscriptdelimited-∥∥superscript𝑥𝐾superscript𝜉𝑗subscript𝜎𝑗superscript𝑥superscript𝜉𝑗2superscript𝜉1superscript𝜉𝑁𝒫\min_{\{\sigma_{j}\}}\;\frac{1}{N}\sum_{i=1}^{N}\left\lVert x^{K}(\xi^{(j)},\{% \sigma_{j}\})-x^{*}(\xi^{(j)})\right\rVert^{2},\quad\xi^{(1)},\dots,\xi^{(N)}% \sim\mathcal{P}.roman_min start_POSTSUBSCRIPT { italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , { italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } ) - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_ξ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∼ caligraphic_P . (ERM)

It is clear that the number of elements in the set {σj}subscript𝜎𝑗\{\sigma_{j}\}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } is at most J:=KK0+1assign𝐽𝐾subscript𝐾01J:=\left\lfloor\frac{K}{K_{0}}\right\rfloor+1italic_J := ⌊ divide start_ARG italic_K end_ARG start_ARG italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⌋ + 1. Thus, the (ERM) is usually a small-scale optimization problem, which can be solved efficiently by existing algorithms, such as SGD and ADAM. We note here that these commonly used optimizers rely on the back-propagation to compute the stochastic gradient estimators of the objective functions in (ERM). This implicitly requires that the computations in Algorithm 2 do not break the computational tree in order to keep track of the gradient information. If the back-propagation fails in practice, stochastic gradient based optimizers are no longer applicable. In this case, we may rely on grid search to find good penalty parameters, though it can be costly.

For the rest of this section, we consider applying the proposed hyperparameter learning approach for solving two class of important problems that attract growing attention in recent years. This first problem is the Lasso problem, which plays an important role in compressed sensing and sparse coding. And the second problem is the discrete optimal transport problem, which has many applications in modern machine learning. Solving them efficiently has been and will remain an active research direction in the literature.

4.1 Application to classical Lasso problems

Recall that the classical Lasso problem [55] takes the following form:

minwn12Dwξ2+μw1,subscript𝑤superscript𝑛12superscriptdelimited-∥∥𝐷𝑤𝜉2𝜇subscriptdelimited-∥∥𝑤1\min_{w\in\mathbb{R}^{n}}\;\;\frac{1}{2}\left\lVert Dw-\xi\right\rVert^{2}+\mu% \left\lVert w\right\rVert_{1},roman_min start_POSTSUBSCRIPT italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_D italic_w - italic_ξ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ ∥ italic_w ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (Lasso(ξ)Lasso𝜉\textrm{Lasso}(\xi)Lasso ( italic_ξ ))

where Dm×n𝐷superscript𝑚𝑛D\in\mathbb{R}^{m\times n}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT is the given dictionary, ξm𝜉superscript𝑚\xi\in\mathbb{R}^{m}italic_ξ ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a random variable sample from a fixed distribution 𝒫𝒫\mathcal{P}caligraphic_P, and μ>0𝜇0\mu>0italic_μ > 0 denotes the regularization parameter. Though problem (Lasso(ξ)Lasso𝜉\textrm{Lasso}(\xi)Lasso ( italic_ξ )) is a special case of (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )) with fξ(w):=12Dwξ2assignsubscript𝑓𝜉𝑤12superscriptdelimited-∥∥𝐷𝑤𝜉2f_{\xi}(w):=\frac{1}{2}\left\lVert Dw-\xi\right\rVert^{2}italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_w ) := divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_D italic_w - italic_ξ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and g(w):=μx1assign𝑔𝑤𝜇subscriptdelimited-∥∥𝑥1g(w):=\mu\left\lVert x\right\rVert_{1}italic_g ( italic_w ) := italic_μ ∥ italic_x ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, it is not desirable to apply Algorithm 1 for solving problem (Lasso(ξ)Lasso𝜉\textrm{Lasso}(\xi)Lasso ( italic_ξ )) directly. The reason is explained as follows. First, since fξsubscript𝑓𝜉f_{\xi}italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT is already a quadratic function, we can set Σ=DTDΣsuperscript𝐷𝑇𝐷\Sigma=D^{T}Droman_Σ = italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D and get

qξ(w;w):=f(w)+fξ(w),ww+12wwDTD2=fξ(w),(w,w)n×n.formulae-sequenceassignsubscript𝑞𝜉𝑤superscript𝑤𝑓superscript𝑤subscript𝑓𝜉superscript𝑤𝑤superscript𝑤12superscriptsubscriptdelimited-∥∥𝑤superscript𝑤superscript𝐷𝑇𝐷2subscript𝑓𝜉𝑤for-all𝑤superscript𝑤superscript𝑛superscript𝑛q_{\xi}(w;w^{\prime}):=f(w^{\prime})+\left\langle\nabla f_{\xi}(w^{\prime}),w-% w^{\prime}\right\rangle+\frac{1}{2}\left\lVert w-w^{\prime}\right\rVert_{D^{T}% D}^{2}=f_{\xi}(w),\quad\forall(w,w^{\prime})\in\mathbb{R}^{n}\times\mathbb{R}^% {n}.italic_q start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_w ; italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := italic_f ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_w - italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_w - italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_w ) , ∀ ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT .

Then, the objective function for the ALM subproblem in Line 2 of Algorithm 1 can be written as

ϕξ,k(w):=12Dwξ2+μw1+12wwk𝒮2,assignsubscriptitalic-ϕ𝜉𝑘𝑤12superscriptdelimited-∥∥𝐷𝑤𝜉2𝜇subscriptdelimited-∥∥𝑤112superscriptsubscriptdelimited-∥∥𝑤superscript𝑤𝑘𝒮2\phi_{\xi,k}(w):=\frac{1}{2}\left\lVert Dw-\xi\right\rVert^{2}+\mu\left\lVert w% \right\rVert_{1}+\frac{1}{2}\left\lVert w-w^{k}\right\rVert_{\mathcal{S}}^{2},italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_w ) := divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_D italic_w - italic_ξ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ ∥ italic_w ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_w - italic_w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

for a given 𝒮𝕊n𝒮superscript𝕊𝑛\mathcal{S}\in\mathbb{S}^{n}caligraphic_S ∈ blackboard_S start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. To ensure the convergence, the linear mapping 𝒮𝒮\mathcal{S}caligraphic_S must satisfy that 𝒮+12DTD0succeeds𝒮12superscript𝐷𝑇𝐷0\mathcal{S}+\frac{1}{2}D^{T}D\succ 0caligraphic_S + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D ≻ 0. If one chooses 𝒮=αIn𝒮𝛼subscript𝐼𝑛\mathcal{S}=\alpha I_{n}caligraphic_S = italic_α italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT where α>0𝛼0\alpha>0italic_α > 0 and Insubscript𝐼𝑛I_{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the identity matrix of size n𝑛nitalic_n, we see that the resulted algorithm is the same as the proximal point algorithm, and the ALM subproblem can still be challenging to solve since one needs to rely on a certain iterative scheme (see e.g., [34] for a more comprehensive study of this approach). On the other hand, if one chooses 𝒮=αIn12DTD𝒮𝛼subscript𝐼𝑛12superscript𝐷𝑇𝐷\mathcal{S}=\alpha I_{n}-\frac{1}{2}D^{T}Dcaligraphic_S = italic_α italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D with α>12λmax(DTD)𝛼12subscript𝜆maxsuperscript𝐷𝑇𝐷\alpha>\frac{1}{2}\lambda_{\textrm{max}}(D^{T}D)italic_α > divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D ), then solving the ALM subproblem is reduced to computing the proximal mapping of the function g(w)=μw1𝑔𝑤𝜇subscriptdelimited-∥∥𝑤1g(w)=\mu\left\lVert w\right\rVert_{1}italic_g ( italic_w ) = italic_μ ∥ italic_w ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which admits analytical expression. In this case the algorithm coincides with the famous ISTA [3] which has been extensively studied in the literature. However, this approach requires evaluating λmax(DTD)subscript𝜆maxsuperscript𝐷𝑇𝐷\lambda_{\textrm{max}}(D^{T}D)italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D ), which could be costly. Moreover, if α𝛼\alphaitalic_α is chosen to be large, the convergence of the resulted algorithm can be quite slow, based on the empirical experience on the practical performance of the ISTA.

Motivated by the above arguments, we propose to solve (Lasso(ξ)Lasso𝜉\textrm{Lasso}(\xi)Lasso ( italic_ξ )) via solving its dual problem.

Lemma 1 (Dual problem of (Lasso(ξ)Lasso𝜉\textrm{Lasso}(\xi)Lasso ( italic_ξ ))).

The dual problem of (Lasso(ξ)Lasso𝜉\textrm{Lasso}(\xi)Lasso ( italic_ξ )) is equivalent to the following minimization problem:

miny1n,y2mδ𝔹μ(y1)+12y22ξ,y2s.t.y1+DTy2=0.\min_{y_{1}\in\mathbb{R}^{n},\;y_{2}\in\mathbb{R}^{m}}\;\delta_{\mathbb{B}_{% \mu}}(-y_{1})+\frac{1}{2}\left\lVert y_{2}\right\rVert^{2}-\left\langle\xi,y_{% 2}\right\rangle\quad\mathrm{s.t.}\quad y_{1}+D^{T}y_{2}=0.roman_min start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT blackboard_B start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⟨ italic_ξ , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ roman_s . roman_t . italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 . (DLasso(ξ)DLasso𝜉\textrm{DLasso}(\xi)DLasso ( italic_ξ ))

where 𝔹μ:={xn:xμ}assignsubscript𝔹𝜇conditional-set𝑥superscript𝑛subscriptdelimited-∥∥𝑥𝜇\mathbb{B}_{\mu}:=\left\{x\in\mathbb{R}^{n}\;:\;\left\lVert x\right\rVert_{% \infty}\leq\mu\right\}blackboard_B start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT := { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : ∥ italic_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_μ } denotes the \infty-norm ball in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with radius μ>0𝜇0\mu>0italic_μ > 0.

It is readily checked that the problem (DLasso(ξ)DLasso𝜉\textrm{DLasso}(\xi)DLasso ( italic_ξ )) is a special case of the general model (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )), and the objective function for the ALM subproblem can be written as

ϕξ,k(y):=δ𝔹μ(y1)+12y,(000Im)y+(xkDxkξ),y+σ2(IDT)y2+12yyk𝒮2,assignsubscriptitalic-ϕ𝜉𝑘𝑦subscript𝛿subscript𝔹𝜇subscript𝑦112𝑦matrix000subscript𝐼𝑚𝑦matrixsuperscript𝑥𝑘𝐷superscript𝑥𝑘𝜉𝑦𝜎2superscriptdelimited-∥∥matrix𝐼superscript𝐷𝑇𝑦212superscriptsubscriptdelimited-∥∥𝑦superscript𝑦𝑘𝒮2\phi_{\xi,k}(y):=\delta_{\mathbb{B}_{\mu}}(-y_{1})+\frac{1}{2}\left\langle y,% \begin{pmatrix}0&0\\ 0&I_{m}\end{pmatrix}y\right\rangle+\left\langle\begin{pmatrix}x^{k}\\ Dx^{k}-\xi\end{pmatrix},y\right\rangle+\frac{\sigma}{2}\left\lVert\begin{% pmatrix}I&D^{T}\end{pmatrix}y\right\rVert^{2}+\frac{1}{2}\left\lVert y-y^{k}% \right\rVert_{\mathcal{S}}^{2},italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y ) := italic_δ start_POSTSUBSCRIPT blackboard_B start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⟨ italic_y , ( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) italic_y ⟩ + ⟨ ( start_ARG start_ROW start_CELL italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_D italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_ξ end_CELL end_ROW end_ARG ) , italic_y ⟩ + divide start_ARG italic_σ end_ARG start_ARG 2 end_ARG ∥ ( start_ARG start_ROW start_CELL italic_I end_CELL start_CELL italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_y - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where 𝒮𝒮\mathcal{S}caligraphic_S is the SGS-operator of the associated 𝒬𝒬\mathcal{Q}caligraphic_Q.

Then, we see that yk+1=argmin{ϕξ,k(y):yn×m}superscript𝑦𝑘1argminconditional-setsubscriptitalic-ϕ𝜉𝑘𝑦𝑦superscript𝑛superscript𝑚y^{k+1}=\mathrm{argmin}\{\phi_{\xi,k}(y)\;:\;y\in\mathbb{R}^{n}\times\mathbb{R% }^{m}\}italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y ) : italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } can be computed as

y2k+1/2=superscriptsubscript𝑦2𝑘12absent\displaystyle y_{2}^{k+1/2}=italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT = argmin{ϕξ,k(y1k,y2):y2m},argminconditional-setsubscriptitalic-ϕ𝜉𝑘superscriptsubscript𝑦1𝑘subscript𝑦2subscript𝑦2superscript𝑚\displaystyle\;\mathrm{argmin}\left\{\phi_{\xi,k}(y_{1}^{k},y_{2})\;:\;y_{2}% \in\mathbb{R}^{m}\right\},roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) : italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } ,
y1k+1=superscriptsubscript𝑦1𝑘1absent\displaystyle y_{1}^{k+1}=italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = argmin{ϕξ,k(y1,y2k+1/2):y1n},argminconditional-setsubscriptitalic-ϕ𝜉𝑘subscript𝑦1superscriptsubscript𝑦2𝑘12subscript𝑦1superscript𝑛\displaystyle\;\mathrm{argmin}\left\{\phi_{\xi,k}(y_{1},y_{2}^{k+1/2})\;:\;y_{% 1}\in\mathbb{R}^{n}\right\},roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT ) : italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } ,
y2k+1=superscriptsubscript𝑦2𝑘1absent\displaystyle y_{2}^{k+1}=italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = argmin{ϕξ,k(y1k+1,y2):y2m}.argminconditional-setsubscriptitalic-ϕ𝜉𝑘superscriptsubscript𝑦1𝑘1subscript𝑦2subscript𝑦2superscript𝑚\displaystyle\;\mathrm{argmin}\left\{\phi_{\xi,k}(y_{1}^{k+1},y_{2})\;:\;y_{2}% \in\mathbb{R}^{m}\right\}.roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) : italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } .

Simple calculation shows that the update of y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT involves solving a linear system with coefficient matrix (Im+σDDT)subscript𝐼𝑚𝜎𝐷superscript𝐷𝑇(I_{m}+\sigma DD^{T})( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_σ italic_D italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) and the update of y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT requires computing the proximal mapping of g𝑔gitalic_g which is the projection operator onto the ball 𝔹μsubscript𝔹𝜇\mathbb{B}_{\mu}blackboard_B start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT. Specifically, Algorithm 2 with 𝒮𝒮\mathcal{S}caligraphic_S chosen to be the SGS-operator for 𝒬𝒬\mathcal{Q}caligraphic_Q given in the above applied for solving the problem (DLasso(ξ)DLasso𝜉\textrm{DLasso}(\xi)DLasso ( italic_ξ )) can be summarized as in Algorithm 5. Readers are referred to Appendix C for the detailed description of the algorithm together with an efficient way of updating (Im+σDDT)1superscriptsubscript𝐼𝑚𝜎𝐷superscript𝐷𝑇1\left(I_{m}+\sigma DD^{T}\right)^{-1}( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_σ italic_D italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

Fixing the dictionary D𝐷Ditalic_D, the initial point (x0,y10,y20)superscript𝑥0superscriptsubscript𝑦10superscriptsubscript𝑦20(x^{0},y_{1}^{0},y_{2}^{0})( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ), the step size τ𝜏\tauitalic_τ and the maximum number of iterations, we can see that the output xKsuperscript𝑥𝐾x^{K}italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT depends on the received signal ξ𝜉\xiitalic_ξ and the penalty parameters {σj}subscript𝜎𝑗\{\sigma_{j}\}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }. To emphasize the aforementioned dependency, we denote xK({σj};ξ):=xKassignsuperscript𝑥𝐾subscript𝜎𝑗𝜉superscript𝑥𝐾x^{K}(\{\sigma_{j}\};\xi):=x^{K}italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( { italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } ; italic_ξ ) := italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT as the output of the algorithm. Then, the learning objective is to solve the following ERM:

min{σj}1Nj=1NxK({σj},ξ(j))x(ξ(j))2,ξ(j)𝒫,j=1,,N.formulae-sequencesimilar-tosubscriptsubscript𝜎𝑗1𝑁superscriptsubscript𝑗1𝑁superscriptdelimited-∥∥superscript𝑥𝐾subscript𝜎𝑗superscript𝜉𝑗superscript𝑥superscript𝜉𝑗2superscript𝜉𝑗𝒫𝑗1𝑁\min_{\{\sigma_{j}\}}\;\frac{1}{N}\sum_{j=1}^{N}\left\lVert x^{K}(\{\sigma_{j}% \},\xi^{(j)})-x^{*}(\xi^{(j)})\right\rVert^{2},\quad\xi^{(j)}\sim\mathcal{P},% \;j=1,\dots,N.roman_min start_POSTSUBSCRIPT { italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( { italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } , italic_ξ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∼ caligraphic_P , italic_j = 1 , … , italic_N . (3)

4.2 Application to optimal transport problems

Let ξ:=(α;β)m×nassign𝜉𝛼𝛽superscript𝑚superscript𝑛\xi:=(\alpha;\beta)\in\mathbb{R}^{m}\times\mathbb{R}^{n}italic_ξ := ( italic_α ; italic_β ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be a given tuple of two discrete probability distributions. Let cm×n𝑐superscript𝑚𝑛c\in\mathbb{R}^{m\times n}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT be a cost matrix. Then, the discrete optimal transport problem [48] is stated as follows:

minxm×nc,xs.t.xen=α,xTem=β,x0.\min_{x\in\mathbb{R}^{m\times n}}\;\;\left\langle c,x\right\rangle\quad\mathrm% {s.t.}\quad xe_{n}=\alpha,\;x^{T}e_{m}=\beta,\;x\geq 0.roman_min start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟨ italic_c , italic_x ⟩ roman_s . roman_t . italic_x italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_α , italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_β , italic_x ≥ 0 . (MOT(ξ)MOT𝜉\textrm{MOT}(\xi)MOT ( italic_ξ ))

Note the (4.2) is an instance of linear programming. By the duality theory for linear programming, the associated dual problem (as an equivalent minimization problem) is given as

miny1m×n,y2m,y3nδ+(y1)α,y2β,y3s.t.y1+y2enT+emy3T=c,\min_{y_{1}\in\mathbb{R}^{m\times n},\;y_{2}\in\mathbb{R}^{m},\;y_{3}\in% \mathbb{R}^{n}}\;\;\delta_{+}(y_{1})-\left\langle\alpha,y_{2}\right\rangle-% \left\langle\beta,y_{3}\right\rangle\quad\mathrm{s.t.}\quad y_{1}+y_{2}e_{n}^{% T}+e_{m}y_{3}^{T}=c,roman_min start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ⟨ italic_α , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ - ⟨ italic_β , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⟩ roman_s . roman_t . italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = italic_c , (DOT(ξ)DOT𝜉\textrm{DOT}(\xi)DOT ( italic_ξ ))

which leads to the following objective function of the ALM subproblem by choosing appropriate 𝒮𝒮\mathcal{S}caligraphic_S:

ϕξ,k(y):=δ+(y1)+(xkxkenα(xk)Temβ),y+σ2(ImnenImInem)yc2+12yyk𝒮2.assignsubscriptitalic-ϕ𝜉𝑘𝑦subscript𝛿subscript𝑦1matrixsuperscript𝑥𝑘superscript𝑥𝑘subscript𝑒𝑛𝛼superscriptsuperscript𝑥𝑘𝑇subscript𝑒𝑚𝛽𝑦𝜎2superscriptdelimited-∥∥tensor-producttensor-productsubscript𝐼𝑚𝑛subscript𝑒𝑛subscript𝐼𝑚subscript𝐼𝑛subscript𝑒𝑚𝑦𝑐212superscriptsubscriptdelimited-∥∥𝑦superscript𝑦𝑘𝒮2\phi_{\xi,k}(y):=\delta_{+}(y_{1})+\left\langle\begin{pmatrix}x^{k}\\ x^{k}e_{n}-\alpha\\ (x^{k})^{T}e_{m}-\beta\end{pmatrix},y\right\rangle+\frac{\sigma}{2}\left\lVert% (I_{mn}\;e_{n}\otimes I_{m}\;I_{n}\otimes e_{m})y-c\right\rVert^{2}+\frac{1}{2% }\left\lVert y-y^{k}\right\rVert_{\mathcal{S}}^{2}.italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y ) := italic_δ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ⟨ ( start_ARG start_ROW start_CELL italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_α end_CELL end_ROW start_ROW start_CELL ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - italic_β end_CELL end_ROW end_ARG ) , italic_y ⟩ + divide start_ARG italic_σ end_ARG start_ARG 2 end_ARG ∥ ( italic_I start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊗ italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) italic_y - italic_c ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_y - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Consequently, we see from Theorem 2 that yk+1=argmin{ϕξ,k(y):ym×n×n×m}superscript𝑦𝑘1argminconditional-setsubscriptitalic-ϕ𝜉𝑘𝑦𝑦superscript𝑚𝑛superscript𝑛superscript𝑚y^{k+1}=\mathrm{argmin}\left\{\phi_{\xi,k}(y)\;:\;y\in\mathbb{R}^{m\times n}% \times\mathbb{R}^{n}\times\mathbb{R}^{m}\right\}italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y ) : italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } can be computed as (see also Algorithm 6 in Appendix D for the detailed description)

y3k+1/2=superscriptsubscript𝑦3𝑘12absent\displaystyle y_{3}^{k+1/2}=italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT = argmin{ϕξ,k(y1k,y2k,y3):y3n},argminconditional-setsubscriptitalic-ϕ𝜉𝑘superscriptsubscript𝑦1𝑘superscriptsubscript𝑦2𝑘subscript𝑦3subscript𝑦3superscript𝑛\displaystyle\;\mathrm{argmin}\left\{\phi_{\xi,k}(y_{1}^{k},y_{2}^{k},y_{3})\;% :\;y_{3}\in\mathbb{R}^{n}\right\},roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) : italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } ,
y2k+1/2=superscriptsubscript𝑦2𝑘12absent\displaystyle y_{2}^{k+1/2}=italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT = argmin{ϕξ,k(y1k,y2,y3k+1/2):y2m},argminconditional-setsubscriptitalic-ϕ𝜉𝑘superscriptsubscript𝑦1𝑘subscript𝑦2superscriptsubscript𝑦3𝑘12subscript𝑦2superscript𝑚\displaystyle\;\mathrm{argmin}\left\{\phi_{\xi,k}(y_{1}^{k},y_{2},y_{3}^{k+1/2% })\;:\;y_{2}\in\mathbb{R}^{m}\right\},roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT ) : italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } ,
y1k+1=superscriptsubscript𝑦1𝑘1absent\displaystyle y_{1}^{k+1}=italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = argmin{ϕξ,k(y1,y2k+1/2,y3k+1/2):y1m×n},argminconditional-setsubscriptitalic-ϕ𝜉𝑘subscript𝑦1superscriptsubscript𝑦2𝑘12superscriptsubscript𝑦3𝑘12subscript𝑦1superscript𝑚𝑛\displaystyle\;\mathrm{argmin}\left\{\phi_{\xi,k}(y_{1},y_{2}^{k+1/2},y_{3}^{k% +1/2})\;:\;y_{1}\in\mathbb{R}^{m\times n}\right\},roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT ) : italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT } ,
y2k+1=superscriptsubscript𝑦2𝑘1absent\displaystyle y_{2}^{k+1}=italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = argmin{ϕξ,k(y1k+1,y2,y3k+1/2):y2m},argminconditional-setsubscriptitalic-ϕ𝜉𝑘superscriptsubscript𝑦1𝑘1subscript𝑦2superscriptsubscript𝑦3𝑘12subscript𝑦2superscript𝑚\displaystyle\;\mathrm{argmin}\left\{\phi_{\xi,k}(y_{1}^{k+1},y_{2},y_{3}^{k+1% /2})\;:\;y_{2}\in\mathbb{R}^{m}\right\},roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT ) : italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } ,
y3k+1=superscriptsubscript𝑦3𝑘1absent\displaystyle y_{3}^{k+1}=italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = argmin{ϕξ,k(y1k+1,y2k,y3):y3n}.argminconditional-setsubscriptitalic-ϕ𝜉𝑘superscriptsubscript𝑦1𝑘1superscriptsubscript𝑦2𝑘subscript𝑦3subscript𝑦3superscript𝑛\displaystyle\;\mathrm{argmin}\left\{\phi_{\xi,k}(y_{1}^{k+1},y_{2}^{k},y_{3})% \;:\;y_{3}\in\mathbb{R}^{n}\right\}.roman_argmin { italic_ϕ start_POSTSUBSCRIPT italic_ξ , italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) : italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } .

Fixing the cost matrix cm×n𝑐superscript𝑚𝑛c\in\mathbb{R}^{m\times n}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, the initial point (x0,y10,y20,y30)superscript𝑥0superscriptsubscript𝑦10superscriptsubscript𝑦20superscriptsubscript𝑦30(x^{0},y_{1}^{0},y_{2}^{0},y_{3}^{0})( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ), the step-size τ𝜏\tauitalic_τ, and the maximum number of iterations, we see that the output xKsuperscript𝑥𝐾x^{K}italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT depends on the marginal distributions ξ:=(α,β)assign𝜉𝛼𝛽\xi:=(\alpha,\beta)italic_ξ := ( italic_α , italic_β ) and the penalty parameters {σj}subscript𝜎𝑗\{\sigma_{j}\}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }. As usual, we denote xK({σj},ξ):=xKassignsuperscript𝑥𝐾subscript𝜎𝑗𝜉superscript𝑥𝐾x^{K}(\{\sigma_{j}\},\xi):=x^{K}italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( { italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } , italic_ξ ) := italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT as the output of the Algorithm 6. Then, the learning objective is to solve the following ERM:

min{σj}1Nj=1NxK({σj},ξ(j))x(ξ(j))2,ξ(j):=(α(j);β(j))𝒫,j=1,,N.formulae-sequenceassignsubscriptsubscript𝜎𝑗1𝑁superscriptsubscript𝑗1𝑁superscriptdelimited-∥∥superscript𝑥𝐾subscript𝜎𝑗superscript𝜉𝑗superscript𝑥superscript𝜉𝑗2superscript𝜉𝑗superscript𝛼𝑗superscript𝛽𝑗similar-to𝒫𝑗1𝑁\min_{\{\sigma_{j}\}}\;\frac{1}{N}\sum_{j=1}^{N}\left\lVert x^{K}(\{\sigma_{j}% \},\xi^{(j)})-x^{*}(\xi^{(j)})\right\rVert^{2},\quad\xi^{(j)}:=(\alpha^{(j)};% \beta^{(j)})\sim\mathcal{P},\;j=1,\dots,N.roman_min start_POSTSUBSCRIPT { italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( { italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } , italic_ξ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT := ( italic_α start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ; italic_β start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ∼ caligraphic_P , italic_j = 1 , … , italic_N . (4)

Obviously, the backpropagation with respect to {σj}subscript𝜎𝑗\{\sigma_{j}\}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } can be done in a straightforward manner.

5 Results

In this section, we validate the practical efficiency of the LMPALM by applying it to solve classical Lasso problems and optimal transport problems.

To facilitate comparison, we define log-normalized mean squared error (NMSE) 1Mi=1M10log10(xixi2xi2)1𝑀superscriptsubscript𝑖1𝑀10subscript10subscriptnormsubscript𝑥𝑖subscriptsuperscript𝑥𝑖2subscriptnormsubscriptsuperscript𝑥𝑖2\frac{1}{M}\sum_{i=1}^{M}10\log_{10}\left(\frac{\|x_{i}-x^{*}_{i}\|_{2}}{\|x^{% *}_{i}\|_{2}}\right)divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ), where {xi}i=1Msuperscriptsubscriptsubscriptsuperscript𝑥𝑖𝑖1𝑀\{x^{*}_{i}\}_{i=1}^{M}{ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and {xi}i=1Msuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑀\{x_{i}\}_{i=1}^{M}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT are the optimal solutions and the predicted solutions output by a certain algorithm associated with the testing data set, respectively. We refer readers to Appendix E for detailed experimental settings.

For Lasso problems, We compare the numerical performance of the LMPALM with the MPALM algorithms using pre-specified penalty parameters and with the LISTA algorithm [24]. The computational results with different choices of (m,n)𝑚𝑛(m,n)( italic_m , italic_n ) that plot the NMSE with respect to iteration numbers are presented in Figure 1. As anticipated, LMPALM outperforms fixed-parameter MPALM. Notably, we also observe that all MPALM-based algorithms admit linear convergence. In particular, LMPALM consistently shows faster convergence rate than the fixed-parameter alternatives. We expect this trend to continue with additional iterations, enhancing the comparative advantage of LMPALM. In comparison to LISTA, LMPALM also performs better. Indeed, LISTA exhibits very slow convergence rate in its early iterations, followed by rapid speed-up in later iterations. Conversely, LMPALM has a stable convergence rate, offering a clear advantage over LISTA’s less uniform convergence pattern. Finally, we observe that when the problem size becomes larger, the problem becomes more difficult to solve and the accuracy of the returned solutions downgrades. In this case, K=64𝐾64K=64italic_K = 64 is not sufficient and more iterations are needed to get solutions with high accuracy.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Lasso: normalized MSE for problem sizes (m,n)=(10,20)𝑚𝑛1020(m,n)=(10,20)( italic_m , italic_n ) = ( 10 , 20 ), (10,100)10100(10,100)( 10 , 100 ), (10,200)10200(10,200)( 10 , 200 ), (20,100)20100(20,100)( 20 , 100 )

For optimal transport problems, we compare the performance of LMPALM with the fixed-parameter MPALM algorithms and the commonly used Sinkhorn’s algorithm for calculating approximate solutions [18]. Figure 2 displays the log-normalized mean-squared errors. From the results, we observe that the LMPALM algorithm empirically outperforms all fixed-parameter MPALM alternative and achieves a faster linear rate of convergence. When it comes to the Sinkhorn’s algorithm, we see that the accuracy of the solution is indeed highly sensitive to the entropy regularization parameter λ𝜆\lambdaitalic_λ. Note that with a sufficiently small entropy parameter λ𝜆\lambdaitalic_λ, the Sinkhorn’s method can approximate the solution to the optimal transport problem sufficiently well. However, small λ𝜆\lambdaitalic_λ results in slow convergence and can lead to some numerical issues. This limits the use of Sinkhorn’s to find a highly accurate solution. Indeed, none of the four values of λ𝜆\lambdaitalic_λ offers a highly accurate solution to the optimal transport problem. On the contrary, the LMPALM approach shows excellent robustness. Lastly, to get higher accuracy, a larger number of iterations is needed.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Optimal transport: NMSE for randomly generated data, with m=n=196𝑚𝑛196m=n=196italic_m = italic_n = 196 and MNIST image data set, with m=n=49𝑚𝑛49m=n=49italic_m = italic_n = 49. The first and third figures: LMPALM; The second and fourth figures: Sinkhorn’s algorithm.

6 Conclusion

In this paper, we successfully applied a Learning to Optimize (L2O) approach to hyperparameter learning for the Majorized Proximal Augmented Lagrangian Method (MPALM), a convergent multi-block ADMM-type method. Our approach has leveraged the convergence properties of MPALM while mitigating its detrimental sensitivity to the penalty hyperparameter. The computational results have demonstrated that MPALM, with adaptively trained hyperparameters, achieves faster convergence compared to existing alternatives in both Lasso and optimal transport problems. Notably, our algorithm’s flexibility allows it to handle optimization problems with arbitrarily many block structures, motivating potential applications to more complex problems such as multi-marginal optimal transport (see Appendix D). However, it is also critical to acknowledge some current limitations of our work. All examples in this study involved Augmented Lagrangian Method (ALM) subproblems that could be solved exactly via elementary linear algebra routines, a requirement for implementing “autograd” for backpropagation. Future research could explore advanced techniques applicable to scenarios where subproblems are solved inexactly, broadening the applicability of our approach to a wider range of practical problems.

Acknowledgments

The authors were partially supported by the US National Science Foundation under awards DMS-2244988, DMS-2206333, the Office of Naval Research Award N00014-23-1-2007, and the DARPA D24AP00325-00.

References

  • [1] Pierre Ablin, Thomas Moreau, Mathurin Massias, and Alexandre Gramfort. Learning step sizes for unfolded sparse coding. Advances in Neural Information Processing Systems, 32, 2019.
  • [2] Jonas Adler and Ozan Öktem. Learned primal-dual reconstruction. IEEE transactions on medical imaging, 37(6):1322–1332, 2018.
  • [3] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
  • [4] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
  • [5] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
  • [6] Alon Brifman, Yaniv Romano, and Michael Elad. Turning a denoiser into a super-resolver using plug and play priors. In 2016 IEEE International Conference on Image Processing (ICIP), pages 1404–1408. IEEE, 2016.
  • [7] Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010.
  • [8] Augustin Cauchy et al. Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847.
  • [9] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40:120–145, 2011.
  • [10] Stanley H Chan, Xiran Wang, and Omar A Elgendy. Plug-and-play ADMM for image restoration: Fixed-point convergence and applications. IEEE Transactions on Computational Imaging, 3(1):84–98, 2016.
  • [11] Caihua Chen, Bingsheng He, Yinyu Ye, and Xiaoming Yuan. The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming, 155(1):57–79, 2016.
  • [12] Liang Chen, Xudong Li, Defeng Sun, and Kim-Chuan Toh. On the equivalence of inexact proximal ALM and ADMM for a class of convex composite programming. Mathematical Programming, 185(1-2):111–161, 2021.
  • [13] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
  • [14] Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. Learning to optimize: A primer and a benchmark. Journal of Machine Learning Research, 23(189):1–59, 2022.
  • [15] Jing Cheng, Haifeng Wang, Leslie Ying, and Dong Liang. Model learning: Primal dual networks for fast MR imaging. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22, pages 21–29. Springer, 2019.
  • [16] Hong TM Chu, Ling Liang, Kim-Chuan Toh, and Lei Yang. An efficient implementable inexact entropic proximal point algorithm for a class of linear programming problems. Computational Optimization and Applications, 85(1):107–146, 2023.
  • [17] Benjamin Cowen, Apoorva Nandini Saridena, and Anna Choromanska. Lsalsa: accelerated source separation via learned sparse coding. Machine Learning, 108:1307–1327, 2019.
  • [18] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport differences. Advances in Neural Information Processing Systems, 26:2292–2300, 2013z.
  • [19] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
  • [20] Matthias Feurer and Frank Hutter. Hyperparameter optimization. Automated machine learning: Methods, systems, challenges, pages 3–33, 2019.
  • [21] Anthony V Fiacco and Garth P McCormick. Nonlinear programming: sequential unconstrained minimization techniques. SIAM, 1990.
  • [22] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & mathematics with applications, 2(1):17–40, 1976.
  • [23] Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for large-scale optimal transport. Advances in neural information processing systems, 29, 2016.
  • [24] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th international conference on international conference on machine learning, pages 399–406, 2010.
  • [25] Satoshi Hara, Weichih Chen, Takashi Washio, Tetsuichi Wazawa, and Takeharu Nagai. Spod-net: Fast recovery of microscopic images using learned ISTA. In Asian Conference on Machine Learning, pages 694–709. PMLR, 2019.
  • [26] Magnus R Hestenes. Multiplier and gradient methods. Journal of optimization theory and applications, 4(5):303–320, 1969.
  • [27] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.
  • [28] Di Hou, Ling Liang, and Kim-Chuan Toh. A sparse smoothing Newton method for solving discrete optimal transport problems. arXiv preprint arXiv:2311.06448, 2023.
  • [29] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization: 5th International Conference, LION 5, Rome, Italy, January 17-21, 2011. Selected Papers 5, pages 507–523. Springer, 2011.
  • [30] Gareth M James, Courtney Paulson, and Paat Rusmevichientong. Penalized and constrained optimization: an application to high-dimensional website advertising. Journal of the American Statistical Association, 2019.
  • [31] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [32] Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. Neural optimal transport. arXiv preprint arXiv:2201.12220, 2022.
  • [33] Xin Yee Lam, JS Marron, Defeng Sun, and Kim-Chuan Toh. Fast algorithms for large-scale generalized distance weighted discrimination. Journal of Computational and Graphical Statistics, 27(2):368–379, 2018.
  • [34] Xudong Li, Defeng Sun, and Kim-Chuan Toh. A highly efficient semismooth Newton augmented Lagrangian method for solving lasso problems. SIAM Journal on Optimization, 28(1):433–458, 2018.
  • [35] Xudong Li, Defeng Sun, and Kim-Chuan Toh. QSDPNAL: A two-phase augmented Lagrangian method for convex quadratic semidefinite programming. Mathematical Programming Computation, 10:703–743, 2018.
  • [36] Xudong Li, Defeng Sun, and Kim-Chuan Toh. A block symmetric Gauss–Seidel decomposition theorem for convex composite quadratic programming and its applications. Mathematical Programming, 175:395–418, 2019.
  • [37] Ling Liang, Xudong Li, Defeng Sun, and Kim-Chuan Toh. QPPAL: A two-phase proximal augmented Lagrangian method for high-dimensional convex quadratic programming problems. ACM Transactions on Mathematical Software (TOMS), 48(3):1–27, 2022.
  • [38] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. Tensor completion for estimating missing values in visual data. IEEE transactions on pattern analysis and machine intelligence, 35(1):208–220, 2012.
  • [39] Jialin Liu and Xiaohan Chen. ALISTA: Analytic weights are as good as learned weights in lista. In International Conference on Learning Representations (ICLR), 2019.
  • [40] Yiyang Liu, Zaiwen Wen, and Wotao Yin. A multiscale semi-smooth newton method for optimal transport. Journal of Scientific Computing, 91(2):39, 2022.
  • [41] Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping via input convex neural networks. In International Conference on Machine Learning, pages 6672–6681. PMLR, 2020.
  • [42] Ronak Mehta, Jeffery Kline, Vishnu Suresh Lokhande, Glenn Fung, and Vikas Singh. Efficient discrete multi marginal optimal transport regularization. In The Eleventh International Conference on Learning Representations, 2023.
  • [43] Vishal Monga, Yuelong Li, and Yonina C Eldar. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. IEEE Signal Processing Magazine, 38(2):18–44, 2021.
  • [44] Thomas Moreau and Joan Bruna. Understanding the learned iterative soft thresholding algorithm with matrix factorization. arXiv preprint arXiv:1706.01338, 2017.
  • [45] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.
  • [46] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and trends® in Optimization, 1(3):127–239, 2014.
  • [47] Dimitris Perdios, Adrien Besson, Philippe Rossinelli, and Jean-Philippe Thiran. Learning the weight matrix for sparsity averaging in compressive imaging. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3056–3060. IEEE, 2017.
  • [48] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  • [49] JH Rick Chang, Chun-Liang Li, Barnabas Poczos, BVK Vijaya Kumar, and Aswin C Sankaranarayanan. One network to solve them all–solving linear inverse problems using deep projection models. In Proceedings of the IEEE International Conference on Computer Vision, pages 5888–5897, 2017.
  • [50] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • [51] R Tyrrell Rockafellar. Conjugate duality and optimization. SIAM, 1974.
  • [52] R Tyrrell Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of operations research, 1(2):97–116, 1976.
  • [53] Ernest Ryu, Jialin Liu, Sicheng Wang, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. Plug-and-play methods provably converge with properly trained denoisers. In International Conference on Machine Learning, pages 5546–5557. PMLR, 2019.
  • [54] Jian Sun, Huibin Li, Zongben Xu, et al. Deep ADMM-Net for compressive sensing mri. Advances in neural information processing systems, 29, 2016.
  • [55] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
  • [56] Singanallur V Venkatakrishnan, Charles A Bouman, and Brendt Wohlberg. Plug-and-play priors for model based reconstruction. In 2013 IEEE global conference on signal and information processing, pages 945–948. IEEE, 2013.
  • [57] Xingyu Xie, Jianlong Wu, Guangcan Liu, Zhisheng Zhong, and Zhouchen Lin. Differentiable linearized ADMM. In International Conference on Machine Learning, pages 6902–6911. PMLR, 2019.
  • [58] Lei Yang, Ling Liang, Hong TM Chu, and Kim-Chuan Toh. A corrected inexact proximal augmented Lagrangian method with a relative error criterion for a class of group-quadratic regularized optimal transport problems. Journal of Scientific Computing, 99(3):79, 2024.
  • [59] Filippo Zanetti and Jacek Gondzio. An interior point–inspired algorithm for linear programs arising in discrete optimal transport. INFORMS Journal on Computing, 35(5):1061–1078, 2023.
  • [60] Sihan Zeng, Alyssa Kody, Youngdae Kim, Kibaek Kim, and Daniel K Molzahn. A reinforcement learning approach to parameter selection for distributed optimal power flow. Electric Power Systems Research, 212:108546, 2022.
  • [61] Chen Zhang, Yu Xie, Hang Bai, Bin Yu, Weihong Li, and Yuan Gao. A survey on federated learning. Knowledge-Based Systems, 216:106775, 2021.
  • [62] Joey Tianyi Zhou, Kai Di, Jiawei Du, Xi Peng, Hao Yang, Sinno Jialin Pan, Ivor W. Tsang, Yong Liu, Zheng Qin, and Rick Siow Mong Goh. Sc2net: sparse lstms for sparse coding. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018.

Appendix A The two-block ADMM for problem (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ ))

By introducing an auxiliary variable z𝑧zitalic_z, we can reformulate the problem (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )) as:

miny𝕐,z𝕐1fξ(y)+g(z),s.t.𝒜y=c,y1z=0.\min_{y\in\mathbb{Y},z\in\mathbb{Y}_{1}}\;f_{\xi}(y)+g(z),\quad\mathrm{s.t.}% \quad\mathcal{A}^{*}y=c,\;y_{1}-z=0.roman_min start_POSTSUBSCRIPT italic_y ∈ blackboard_Y , italic_z ∈ blackboard_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y ) + italic_g ( italic_z ) , roman_s . roman_t . caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y = italic_c , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_z = 0 .

Given ξ𝜉\xiitalic_ξ and a penalty parameter σ>0𝜎0\sigma>0italic_σ > 0, the augmented Lagrangian function associated with the above problem can be written as

ξ,σ(y,z;x,w):=fξ(y)+g(z)+x,𝒜yc+w,y1z+σ2𝒜yc2+σ2y1z2.assignsubscript𝜉𝜎𝑦𝑧𝑥𝑤subscript𝑓𝜉𝑦𝑔𝑧𝑥superscript𝒜𝑦𝑐𝑤subscript𝑦1𝑧𝜎2superscriptdelimited-∥∥superscript𝒜𝑦𝑐2𝜎2superscriptdelimited-∥∥subscript𝑦1𝑧2\mathcal{L}_{\xi,\sigma}(y,z;x,w):=f_{\xi}(y)+g(z)+\left\langle x,\mathcal{A}^% {*}y-c\right\rangle+\left\langle w,y_{1}-z\right\rangle+\frac{\sigma}{2}\left% \lVert\mathcal{A}^{*}y-c\right\rVert^{2}+\frac{\sigma}{2}\left\lVert y_{1}-z% \right\rVert^{2}.caligraphic_L start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_y , italic_z ; italic_x , italic_w ) := italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y ) + italic_g ( italic_z ) + ⟨ italic_x , caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y - italic_c ⟩ + ⟨ italic_w , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_z ⟩ + divide start_ARG italic_σ end_ARG start_ARG 2 end_ARG ∥ caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y - italic_c ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_σ end_ARG start_ARG 2 end_ARG ∥ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Then, the two-block ADMM method can be described in Algorithm 3.

Input: A fixed point ξΞ𝜉Ξ\xi\in\Xiitalic_ξ ∈ roman_Ξ, an initial point (x0,y0,z0,w0)superscript𝑥0superscript𝑦0superscript𝑧0superscript𝑤0(x^{0},y^{0},z^{0},w^{0})( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ), the penalty parameter σ>0𝜎0\sigma>0italic_σ > 0, the step size τ(0,1+5/2)𝜏0152\tau\in(0,\sqrt{1+5}/2)italic_τ ∈ ( 0 , square-root start_ARG 1 + 5 end_ARG / 2 ), and the maximum number of iterations K>0𝐾0K>0italic_K > 0.
1
2for k=0,,K1𝑘0𝐾1k=0,\dots,K-1italic_k = 0 , … , italic_K - 1 do
3       yk+1=argminξ,σ(y,zk;xk,wk)superscript𝑦𝑘1argminsubscript𝜉𝜎𝑦superscript𝑧𝑘superscript𝑥𝑘superscript𝑤𝑘y^{k+1}=\mathrm{argmin}\mathcal{L}_{\xi,\sigma}(y,z^{k};x^{k},w^{k})italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = roman_argmin caligraphic_L start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_y , italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).
4      
5      zk+1=argminξ,σ(yk+1,z;xk,wk)superscript𝑧𝑘1argminsubscript𝜉𝜎superscript𝑦𝑘1𝑧superscript𝑥𝑘superscript𝑤𝑘z^{k+1}=\mathrm{argmin}\mathcal{L}_{\xi,\sigma}(y^{k+1},z;x^{k},w^{k})italic_z start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = roman_argmin caligraphic_L start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_z ; italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).
6      
7      xk+1=xk+τσ(𝒜yk+1c)superscript𝑥𝑘1superscript𝑥𝑘𝜏𝜎superscript𝒜superscript𝑦𝑘1𝑐x^{k+1}=x^{k}+\tau\sigma\left(\mathcal{A}^{*}y^{k+1}-c\right)italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_τ italic_σ ( caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_c ).
8      
9      wk+1=wk+τσ(y1k+1zk+1)superscript𝑤𝑘1superscript𝑤𝑘𝜏𝜎superscriptsubscript𝑦1𝑘1superscript𝑧𝑘1w^{k+1}=w^{k}+\tau\sigma\left(y_{1}^{k+1}-z^{k+1}\right)italic_w start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = italic_w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_τ italic_σ ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_z start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ).
10      
11 end for
Output: (xK,yK,zK,wK)superscript𝑥𝐾superscript𝑦𝐾superscript𝑧𝐾superscript𝑤𝐾(x^{K},y^{K},z^{K},w^{K})( italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ).
Algorithm 3 The classic two-block ADMM.

We can see that for updating y𝑦yitalic_y, we need to solve an optimization over the whole space 𝕐𝕐\mathbb{Y}blackboard_Y, which ignore the exploration of the separable structure of the problem.

Appendix B Direct multi-block extension of the two-block ADMM

Recall that the augmented Lagrangian function associated with the problem (P(ξ)𝑃𝜉P(\xi)italic_P ( italic_ξ )) is defined as

σ(y;x):=fξ(y)+g(y1)+𝒜yc,x+σ2𝒜yc2,(y,x)𝕐×𝕏,formulae-sequenceassignsubscript𝜎𝑦𝑥subscript𝑓𝜉𝑦𝑔subscript𝑦1superscript𝒜𝑦𝑐𝑥𝜎2superscriptdelimited-∥∥superscript𝒜𝑦𝑐2for-all𝑦𝑥𝕐𝕏\mathcal{L}_{\sigma}(y;x):=f_{\xi}(y)+g(y_{1})+\left\langle\mathcal{A}^{*}y-c,% x\right\rangle+\frac{\sigma}{2}\left\lVert\mathcal{A}^{*}y-c\right\rVert^{2},% \quad\forall(y,x)\in\mathbb{Y}\times\mathbb{X},caligraphic_L start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_y ; italic_x ) := italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_y ) + italic_g ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ⟨ caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y - italic_c , italic_x ⟩ + divide start_ARG italic_σ end_ARG start_ARG 2 end_ARG ∥ caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y - italic_c ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ ( italic_y , italic_x ) ∈ blackboard_Y × blackboard_X ,

where x𝕏𝑥𝕏x\in\mathbb{X}italic_x ∈ blackboard_X denotes the Lagrange multiplier with respect to the linear constraint 𝒜y=csuperscript𝒜𝑦𝑐\mathcal{A}^{*}y=ccaligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y = italic_c. Then the multi-block ADMM has the following template, as shown in Algorithm 4.

Input: A fixed point ξΞ𝜉Ξ\xi\in\Xiitalic_ξ ∈ roman_Ξ, an initial point (x0,y0)superscript𝑥0superscript𝑦0(x^{0},y^{0})( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ), the penalty parameter σ>0𝜎0\sigma>0italic_σ > 0, the step size τ(0,1+5/2)𝜏0152\tau\in(0,\sqrt{1+5}/2)italic_τ ∈ ( 0 , square-root start_ARG 1 + 5 end_ARG / 2 ), and the maximum number of iterations K>0𝐾0K>0italic_K > 0.
1
2for k=0,,K1𝑘0𝐾1k=0,\dots,K-1italic_k = 0 , … , italic_K - 1 do
3      
4      for i=1,,p𝑖1𝑝i=1,\dots,pitalic_i = 1 , … , italic_p do
5             yik+1=argminξ,σ(y1k+1,,yi1k+1,yi,yi+1k,,ypk;xk)superscriptsubscript𝑦𝑖𝑘1argminsubscript𝜉𝜎superscriptsubscript𝑦1𝑘1superscriptsubscript𝑦𝑖1𝑘1subscript𝑦𝑖superscriptsubscript𝑦𝑖1𝑘superscriptsubscript𝑦𝑝𝑘superscript𝑥𝑘y_{i}^{k+1}=\mathrm{argmin}\mathcal{L}_{\xi,\sigma}(y_{1}^{k+1},\dots,y_{i-1}^% {k+1},y_{i},y_{i+1}^{k},\dots,y_{p}^{k};x^{k})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = roman_argmin caligraphic_L start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).
6            
7       end for
8      
9      xk+1=xk+τσ(𝒜yk+1c)superscript𝑥𝑘1superscript𝑥𝑘𝜏𝜎superscript𝒜superscript𝑦𝑘1𝑐x^{k+1}=x^{k}+\tau\sigma\left(\mathcal{A}^{*}y^{k+1}-c\right)italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_τ italic_σ ( caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_c ).
10      
11 end for
Output: (xK,yK)superscript𝑥𝐾superscript𝑦𝐾(x^{K},y^{K})( italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ).
Algorithm 4 The mutli-block ADMM.

However, as mentioned in the introduction, the above direct extension is not convergent unless under more stringent conditions [11].

Appendix C MPALM for Lasso

The MPALM applied for solving the dual form of the Lasso problem is presented in Algorithm 5.

Input: The dictionary Dm×n𝐷superscript𝑚𝑛D\in\mathbb{R}^{m\times n}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, the received signal ξm𝜉superscript𝑚\xi\in\mathbb{R}^{m}italic_ξ ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, the regularization parameter μ>0𝜇0\mu>0italic_μ > 0, an initial point (x0,y10,y20)n×n×msuperscript𝑥0superscriptsubscript𝑦10superscriptsubscript𝑦20superscript𝑛superscript𝑛superscript𝑚(x^{0},y_{1}^{0},y_{2}^{0})\in\mathbb{R}^{n}\times\mathbb{R}^{n}\times\mathbb{% R}^{m}( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, the penalty parameter {σj}subscript𝜎𝑗\{\sigma_{j}\}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }, the step size τ(0,2)𝜏02\tau\in(0,2)italic_τ ∈ ( 0 , 2 ), the maximum number of iteration K>0𝐾0K>0italic_K > 0 and a positive integer K0Ksubscript𝐾0𝐾K_{0}\leq Kitalic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_K.
1 for k=0,,K1𝑘0𝐾1k=0,\dots,K-1italic_k = 0 , … , italic_K - 1 do
2       Find j𝑗jitalic_j such that k[jK0,(j+1)K0)𝑘𝑗subscript𝐾0𝑗1subscript𝐾0k\in[jK_{0},(j+1)K_{0})italic_k ∈ [ italic_j italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_j + 1 ) italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and set σ=σj𝜎subscript𝜎𝑗\sigma=\sigma_{j}italic_σ = italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
3      
4      y2k+1/2=(Im+σDDT)1(ξDxkσDy1k)superscriptsubscript𝑦2𝑘12superscriptsubscript𝐼𝑚𝜎𝐷superscript𝐷𝑇1𝜉𝐷superscript𝑥𝑘𝜎𝐷superscriptsubscript𝑦1𝑘y_{2}^{k+1/2}=\left(I_{m}+\sigma DD^{T}\right)^{-1}\left(\xi-Dx^{k}-\sigma Dy_% {1}^{k}\right)italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT = ( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_σ italic_D italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_ξ - italic_D italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_σ italic_D italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).
5      
6      y1k+1=proj𝔹μ(DTy2k+1/2+1σxk)superscriptsubscript𝑦1𝑘1subscriptprojsubscript𝔹𝜇superscript𝐷𝑇superscriptsubscript𝑦2𝑘121𝜎superscript𝑥𝑘y_{1}^{k+1}=-\mathrm{proj}_{\mathbb{B}_{\mu}}\left(D^{T}y_{2}^{k+1/2}+\frac{1}% {\sigma}x^{k}\right)italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = - roman_proj start_POSTSUBSCRIPT blackboard_B start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).
7      
8      y2k+1=(Im+σDDT)1(ξDxkσDy1k+1)superscriptsubscript𝑦2𝑘1superscriptsubscript𝐼𝑚𝜎𝐷superscript𝐷𝑇1𝜉𝐷superscript𝑥𝑘𝜎𝐷superscriptsubscript𝑦1𝑘1y_{2}^{k+1}=\left(I_{m}+\sigma DD^{T}\right)^{-1}\left(\xi-Dx^{k}-\sigma Dy_{1% }^{k+1}\right)italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = ( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_σ italic_D italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_ξ - italic_D italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_σ italic_D italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ).
9      
10      xk+1=xk+τσ(y1k+1+DTy2k+1)superscript𝑥𝑘1superscript𝑥𝑘𝜏𝜎superscriptsubscript𝑦1𝑘1superscript𝐷𝑇superscriptsubscript𝑦2𝑘1x^{k+1}=x^{k}+\tau\sigma\left(y_{1}^{k+1}+D^{T}y_{2}^{k+1}\right)italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_τ italic_σ ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT + italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ).
11      
12 end for
Output: xKsuperscript𝑥𝐾x^{K}italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.
Algorithm 5 The proximal ALM for (DLasso(ξ)DLasso𝜉\textrm{DLasso}(\xi)DLasso ( italic_ξ ))

We next show how to efficiently update the inverse of the matrix (Im+σDDT)subscript𝐼𝑚𝜎𝐷superscript𝐷𝑇\left(I_{m}+\sigma DD^{T}\right)( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_σ italic_D italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) without breaking down the computational tree when performing backpropagation as needed in optimizing (3). Let the spectral decomposition of DDT𝐷superscript𝐷𝑇DD^{T}italic_D italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be given as

DDT=PΛPT,Λ=Diag(λ1,,λm),λ1λm0.formulae-sequence𝐷superscript𝐷𝑇𝑃Λsuperscript𝑃𝑇formulae-sequenceΛDiagsubscript𝜆1subscript𝜆𝑚subscript𝜆1subscript𝜆𝑚0DD^{T}=P\Lambda P^{T},\quad\Lambda=\mathrm{Diag}(\lambda_{1},\dots,\lambda_{m}% ),\quad\lambda_{1}\geq\dots\geq\lambda_{m}\geq 0.italic_D italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = italic_P roman_Λ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , roman_Λ = roman_Diag ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ ⋯ ≥ italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≥ 0 .

Then for any σ>0𝜎0\sigma>0italic_σ > 0, we see that the matrix Im+σDDTsubscript𝐼𝑚𝜎𝐷superscript𝐷𝑇I_{m}+\sigma DD^{T}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_σ italic_D italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT admits a spectral decomposition

Im+σDDT=PDiag(1+σλ1,,1+σλm)PT.subscript𝐼𝑚𝜎𝐷superscript𝐷𝑇𝑃Diag1𝜎subscript𝜆11𝜎subscript𝜆𝑚superscript𝑃𝑇I_{m}+\sigma DD^{T}=P\mathrm{Diag}\left(1+\sigma\lambda_{1},\dots,1+\sigma% \lambda_{m}\right)P^{T}.italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_σ italic_D italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = italic_P roman_Diag ( 1 + italic_σ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , 1 + italic_σ italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

Hence, we get

(Im+σDDT)1=PDiag(11+σλ1,,11+σλm)PT.superscriptsubscript𝐼𝑚𝜎𝐷superscript𝐷𝑇1𝑃Diag11𝜎subscript𝜆111𝜎subscript𝜆𝑚superscript𝑃𝑇\left(I_{m}+\sigma DD^{T}\right)^{-1}=P\mathrm{Diag}\left(\frac{1}{1+\sigma% \lambda_{1}},\dots,\frac{1}{1+\sigma\lambda_{m}}\right)P^{T}.( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_σ italic_D italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_P roman_Diag ( divide start_ARG 1 end_ARG start_ARG 1 + italic_σ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , … , divide start_ARG 1 end_ARG start_ARG 1 + italic_σ italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ) italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

Here, the orthogonal matrix P𝑃Pitalic_P and the eigenvalues λ1,,λmsubscript𝜆1subscript𝜆𝑚\lambda_{1},\dots,\lambda_{m}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT need only to be computed once. So, the inverse (Im+σDDT)1superscriptsubscript𝐼𝑚𝜎𝐷superscript𝐷𝑇1\left(I_{m}+\sigma DD^{T}\right)^{-1}( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_σ italic_D italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, as a function of σ𝜎\sigmaitalic_σ, is continuously differentiable in σ𝜎\sigmaitalic_σ.

Appendix D MPALM for optimal transport

The MPALM applied for solving the dual problem of optimal transport problem can be described in Algorithm 6.

Input: Two marginal distributions ξ:=(α;β)m×nassign𝜉𝛼𝛽superscript𝑚superscript𝑛\xi:=(\alpha;\beta)\in\mathbb{R}^{m}\times\mathbb{R}^{n}italic_ξ := ( italic_α ; italic_β ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the cost matrix cm×n𝑐superscript𝑚𝑛c\in\mathbb{R}^{m\times n}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, an initial point (x0,y10,y20,y30)m×n×m×n×m×nsuperscript𝑥0superscriptsubscript𝑦10superscriptsubscript𝑦20superscriptsubscript𝑦30superscript𝑚𝑛superscript𝑚𝑛superscript𝑚superscript𝑛(x^{0},y_{1}^{0},y_{2}^{0},y_{3}^{0})\in\mathbb{R}^{m\times n}\times\mathbb{R}% ^{m\times n}\times\mathbb{R}^{m}\times\mathbb{R}^{n}( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the penalty parameter {σj}subscript𝜎𝑗\{\sigma_{j}\}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }, the step size τ(0,2)𝜏02\tau\in(0,2)italic_τ ∈ ( 0 , 2 ), the maximum number of iteration K>0𝐾0K>0italic_K > 0 and a positive integer K0Ksubscript𝐾0𝐾K_{0}\leq Kitalic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_K.
1 for k=0,,K1𝑘0𝐾1k=0,\dots,K-1italic_k = 0 , … , italic_K - 1 do
2       Find j𝑗jitalic_j such that k[jK0,(j+1)K0)𝑘𝑗subscript𝐾0𝑗1subscript𝐾0k\in[jK_{0},(j+1)K_{0})italic_k ∈ [ italic_j italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_j + 1 ) italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and set σ=σj𝜎subscript𝜎𝑗\sigma=\sigma_{j}italic_σ = italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
3      
4      y3k+1/2=1m(1σβ(y1k+y2kenTc+1σxk)Tem)superscriptsubscript𝑦3𝑘121𝑚1𝜎𝛽superscriptsuperscriptsubscript𝑦1𝑘superscriptsubscript𝑦2𝑘superscriptsubscript𝑒𝑛𝑇𝑐1𝜎superscript𝑥𝑘𝑇subscript𝑒𝑚y_{3}^{k+1/2}=\frac{1}{m}\left(\frac{1}{\sigma}\beta-\left(y_{1}^{k}+y_{2}^{k}% e_{n}^{T}-c+\frac{1}{\sigma}x^{k}\right)^{T}e_{m}\right)italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG italic_β - ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_c + divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ).
5      
6      y2k+1/2=1n(1σα(y1k+em(y3k+1/2)Tc+1σxk)en)superscriptsubscript𝑦2𝑘121𝑛1𝜎𝛼superscriptsubscript𝑦1𝑘subscript𝑒𝑚superscriptsuperscriptsubscript𝑦3𝑘12𝑇𝑐1𝜎superscript𝑥𝑘subscript𝑒𝑛y_{2}^{k+1/2}=\frac{1}{n}\left(\frac{1}{\sigma}\alpha-\left(y_{1}^{k}+e_{m}(y_% {3}^{k+1/2})^{T}-c+\frac{1}{\sigma}x^{k}\right)e_{n}\right)italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG italic_α - ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_c + divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).
7      
8      y1k+1=max{0,cy2k+1/2enTem(y3k+1/2)T1σxk}superscriptsubscript𝑦1𝑘10𝑐superscriptsubscript𝑦2𝑘12superscriptsubscript𝑒𝑛𝑇subscript𝑒𝑚superscriptsuperscriptsubscript𝑦3𝑘12𝑇1𝜎superscript𝑥𝑘y_{1}^{k+1}=\max\left\{0,c-y_{2}^{k+1/2}e_{n}^{T}-e_{m}(y_{3}^{k+1/2})^{T}-% \frac{1}{\sigma}x^{k}\right\}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = roman_max { 0 , italic_c - italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT }.
9      
10      y2k+1=1n(1σα(y1k+1+em(y3k+1/2)Tc+1σxk)en)superscriptsubscript𝑦2𝑘11𝑛1𝜎𝛼superscriptsubscript𝑦1𝑘1subscript𝑒𝑚superscriptsuperscriptsubscript𝑦3𝑘12𝑇𝑐1𝜎superscript𝑥𝑘subscript𝑒𝑛y_{2}^{k+1}=\frac{1}{n}\left(\frac{1}{\sigma}\alpha-\left(y_{1}^{k+1}+e_{m}(y_% {3}^{k+1/2})^{T}-c+\frac{1}{\sigma}x^{k}\right)e_{n}\right)italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG italic_α - ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT + italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_c + divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).
11      
12      y3k+1=1m(1σβ(y1k+1+y2k+1enTc+1σxk)Tem)superscriptsubscript𝑦3𝑘11𝑚1𝜎𝛽superscriptsuperscriptsubscript𝑦1𝑘1superscriptsubscript𝑦2𝑘1superscriptsubscript𝑒𝑛𝑇𝑐1𝜎superscript𝑥𝑘𝑇subscript𝑒𝑚y_{3}^{k+1}=\frac{1}{m}\left(\frac{1}{\sigma}\beta-\left(y_{1}^{k+1}+y_{2}^{k+% 1}e_{n}^{T}-c+\frac{1}{\sigma}x^{k}\right)^{T}e_{m}\right)italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG italic_β - ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT + italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_c + divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ).
13      
14      xk+1=xk+τσ(y1k+1+y2k+1enT+em(y3k+1)Tc)superscript𝑥𝑘1superscript𝑥𝑘𝜏𝜎superscriptsubscript𝑦1𝑘1superscriptsubscript𝑦2𝑘1superscriptsubscript𝑒𝑛𝑇subscript𝑒𝑚superscriptsuperscriptsubscript𝑦3𝑘1𝑇𝑐x^{k+1}=x^{k}+\tau\sigma\left(y_{1}^{k+1}+y_{2}^{k+1}e_{n}^{T}+e_{m}(y_{3}^{k+% 1})^{T}-c\right)italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_τ italic_σ ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT + italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_c ).
15      
16 end for
Output: xKsuperscript𝑥𝐾x^{K}italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.
Algorithm 6 The proximal ALM for (DOT(ξ)DOT𝜉\textrm{DOT}(\xi)DOT ( italic_ξ ))
Remark 1.

The proposed framework can be easily extended to solve the multi-marginal optimal transport problems [42]:

minxn1××nqc,xs.t.𝒜i(x)=α(i),i=1,,q,x0,\min_{x\in\mathbb{R}^{n_{1}\times\dots\times n_{q}}}\;\;\left\langle c,x\right% \rangle\quad\mathrm{s.t.}\quad\mathcal{A}_{i}(x)=\alpha^{(i)},\quad i=1,\dots,% q,\;x\geq 0,roman_min start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_n start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟨ italic_c , italic_x ⟩ roman_s . roman_t . caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_i = 1 , … , italic_q , italic_x ≥ 0 ,

where cn1××nq𝑐superscriptsubscript𝑛1subscript𝑛𝑞c\in\mathbb{R}^{n_{1}\times\dots\times n_{q}}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_n start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the cost tensor, 𝒜i:n1××nqni:subscript𝒜𝑖superscriptsubscript𝑛1subscript𝑛𝑞superscriptsubscript𝑛𝑖\mathcal{A}_{i}:\mathbb{R}^{n_{1}\times\dots\times n_{q}}\to\mathbb{R}^{n_{i}}caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_n start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a linear mapping that compute the i𝑖iitalic_i-th marginal of its input, and α(i)nisuperscript𝛼𝑖superscriptsubscript𝑛𝑖\alpha^{(i)}\in\mathbb{R}^{n_{i}}italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are given marginal distribution, for i=1,,q𝑖1𝑞i=1,\dots,qitalic_i = 1 , … , italic_q. The corresponding dual problem (as an equivalent minimization problem) is then given by

minyini, 1iqδ+(y1)i=1qα(i),yi+1s.t.y1+y2yq+1=c,\min_{y_{i}\in\mathbb{R}^{n_{i}},\;1\leq i\leq q}\;\;\delta_{+}(y_{1})-\sum_{i% =1}^{q}\left\langle\alpha^{(i)},y_{i+1}\right\rangle\quad\mathrm{s.t.}\quad y_% {1}+y_{2}\oplus\dots\oplus y_{q+1}=c,roman_min start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , 1 ≤ italic_i ≤ italic_q end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ⟨ italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ⟩ roman_s . roman_t . italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊕ ⋯ ⊕ italic_y start_POSTSUBSCRIPT italic_q + 1 end_POSTSUBSCRIPT = italic_c ,

where y2yq+1n1××nqdirect-sumsubscript𝑦2subscript𝑦𝑞1superscriptsubscript𝑛1subscript𝑛𝑞y_{2}\oplus\dots\oplus y_{q+1}\in\mathbb{R}^{n_{1}\times\dots\times n_{q}}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊕ ⋯ ⊕ italic_y start_POSTSUBSCRIPT italic_q + 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_n start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the tensor whose (i2,,iq+1)subscript𝑖2subscript𝑖𝑞1(i_{2},\dots,i_{q+1})( italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_q + 1 end_POSTSUBSCRIPT ) entry is y2(i2)++yq+1(iq+1)subscript𝑦2subscript𝑖2subscript𝑦𝑞1subscript𝑖𝑞1y_{2}(i_{2})+\dots+y_{q+1}(i_{q+1})italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + ⋯ + italic_y start_POSTSUBSCRIPT italic_q + 1 end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT italic_q + 1 end_POSTSUBSCRIPT ), for any indices 1i2n1,,1iq+1nqformulae-sequence1subscript𝑖2subscript𝑛11subscript𝑖𝑞1subscript𝑛𝑞1\leq i_{2}\leq n_{1},\;\dots,1\leq i_{q+1}\leq n_{q}1 ≤ italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , 1 ≤ italic_i start_POSTSUBSCRIPT italic_q + 1 end_POSTSUBSCRIPT ≤ italic_n start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. We see that the dual problem has q+1𝑞1q+1italic_q + 1 blocks and the proposed methodology is directly applicable.

Appendix E Detailed experimental settings

We shall present the detailed experimental setting for both applications.

E.1 Lasso problems

It is known that L2O has demonstrated promising performance in the Lasso problem. One notable L2O approach, extensively studied in the literature, is the Learned ISTA (LISTA) [24]. LISTA, similar in structure to ISTA but with trained parameters, exhibits substantial improvements over its predecessor. Recent advancements have been made to further improve the convergence properties of the approach, however, the improvement in terms of practical efficiency is relatively modest, based on the results presented in [14] and references therein. We focus on demonstrating that our proposed algorithm significantly outperforms the original LISTA, thereby outperforming similar approaches that share comparable performance with LISTA.

Following a similar procedure in [14], we consider testing randomly generated data. Particularly, in our experiment, the regularization parameter μ𝜇\muitalic_μ is fixed to be 0.10.10.10.1 and Dm×n𝐷superscript𝑚𝑛D\in\mathbb{R}^{m\times n}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT is generated by sampling its entries from the standard Gaussian distribution and then normalizing the columns to have unit norms. With given μ𝜇\muitalic_μ and D𝐷Ditalic_D, we then sample 50,0005000050,00050 , 000 vectors {ξ(i)}i=1Nsuperscriptsubscriptsuperscript𝜉𝑖𝑖1𝑁\{\xi^{(i)}\}_{i=1}^{N}{ italic_ξ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT from the standard Gaussian distribution to generate a set of Lasso problems. The optimal solutions {xi}i=1Nsuperscriptsubscriptsuperscriptsubscript𝑥𝑖𝑖1𝑁\{x_{i}^{*}\}_{i=1}^{N}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT for these problems are then computed using the powerful commercial solver GUROBI (version 11.0.1, with an academic license). The training size is set as N=45,000𝑁45000N=45,000italic_N = 45 , 000 and the testing size is set as M=5,000𝑀5000M=5,000italic_M = 5 , 000.

We compare the numerical performance of the LMPALM with the MPALM algorithms using pre-specified penalty parameters σ=10k𝜎superscript10𝑘\sigma=10^{k}italic_σ = 10 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where k=2,1,0,1,2𝑘21012k=-2,-1,0,1,2italic_k = - 2 , - 1 , 0 , 1 , 2, and with the LISTA algorithm [24]. The computational results with different choices of (m,n)𝑚𝑛(m,n)( italic_m , italic_n ) that plot the NMSE with respect to iteration numbers are presented in Figure 1, where the maximum number of iterations is set as K=64𝐾64K=64italic_K = 64. Both LMPALM and LISTA are trained to minimize objective ERM using the AdamW optimizer. Hyperparameters for the training of LISTA are taken from [39]. For LMPALM, we set betas=(0.999,0.999)betas0.9990.999\texttt{betas}=(0.999,0.999)betas = ( 0.999 , 0.999 ) and learning rate as lr=0.001lr0.001\texttt{lr}=0.001lr = 0.001 222See https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html for more detials.. Moreover, in LMPALM, we use eight restarts, i.e., a set of eight parameters {σj}j=18superscriptsubscriptsubscript𝜎𝑗𝑗18\{\sigma_{j}\}_{j=1}^{8}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT is learned. The parameters are initialized as 1. Based on our numerical experience, increasing the number of restarts enhances the robustness of the algorithm, albeit with increased computational cost. A batch size of 4,50045004,5004 , 500 is used in our training and the model is trained for 250 epochs.

E.2 Optimal transport problems

Next, we shall consider the optimal transport problem. In our experiments, we set m=n𝑚𝑛m=nitalic_m = italic_n for simplicity.

The cost matrix cm×n𝑐superscript𝑚𝑛c\in\mathbb{R}^{m\times n}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT captures the squared distance between corresponding entries of α𝛼\alphaitalic_α and β𝛽\betaitalic_β, i.e., Cij=|ij|2subscript𝐶𝑖𝑗superscript𝑖𝑗2C_{ij}=|i-j|^{2}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = | italic_i - italic_j | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for 1i,jmformulae-sequence1𝑖𝑗𝑚1\leq i,j\leq m1 ≤ italic_i , italic_j ≤ italic_m.

We consider the following two ways for generating the data set {(α(i),β(i))}i=15000superscriptsubscriptsuperscript𝛼𝑖superscript𝛽𝑖𝑖15000\{(\alpha^{(i)},\beta^{(i)})\}_{i=1}^{5000}{ ( italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5000 end_POSTSUPERSCRIPT consisting of 5,000 instances: (1) the marginal distributions α(i)superscript𝛼𝑖\alpha^{(i)}italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and β(i)superscript𝛽𝑖\beta^{(i)}italic_β start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT are randomly generated whose entries are drawn from the uniform distribution on the interval (0,1)01(0,1)( 0 , 1 ); (2) α(i)superscript𝛼𝑖\alpha^{(i)}italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and β(i)superscript𝛽𝑖\beta^{(i)}italic_β start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT are generated by flattening 7×7777\times 77 × 7 MNIST images corresponding to the digits “2” and “4”, respectively. Note that we also normalized the distributions α(i)superscript𝛼𝑖\alpha^{(i)}italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and β(i)superscript𝛽𝑖\beta^{(i)}italic_β start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT by dividing their sums so that they all sum up to one. Again, for each case, the corresponding true optimal solution is computed by the commercial solver GUBORI. The training size and the testing size are set as N=4,500𝑁4500N=4,500italic_N = 4 , 500 and M=500𝑀500M=500italic_M = 500, respectively. We train LMPALM with four restarts for 500 epochs with a batch size of 2,750. As usual, the initial parameters {σj}j=14superscriptsubscriptsubscript𝜎𝑗𝑗14\{\sigma_{j}\}_{j=1}^{4}{ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT are initialized as one. For MPALM-based methods, we set the total number of iterations as K=100𝐾100K=100italic_K = 100.