1 Introduction

Efficient Combinatorial Optimization via Heat Diffusion
Hengyuan Ma¹, Wenlian Lu^1,2,3,4,5,6, Jianfeng Feng^{1,2,3,4,7 $\ast$}

1 Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai 200433, China
2 Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
3 School of Mathematical Sciences, Fudan University, No. 220 Handan Road, Shanghai, 200433, Shanghai, China
4 Shanghai Center for Mathematical Sciences, No. 220 Handan Road, Shanghai, 200433, Shanghai, China
5 Shanghai Key Laboratory for Contemporary Applied Mathematics, No. 220 Handan Road, Shanghai, 200433, Shanghai, China
6 Key Laboratory of Mathematics for Nonlinear Science, No. 220 Handan Road, Shanghai, 200433, Shanghai, China
7 Department of Computer Science, University of Warwick, Coventry, CV4 7AL, UK
$\ast$ [email protected]

Abstract

Combinatorial optimization problems are widespread but inherently challenging due to their discrete nature. The primary limitation of existing methods is that they can only access a small fraction of the solution space at each iteration, resulting in limited efficiency for searching the global optimal. To overcome this challenge, diverging from conventional efforts of expanding the solver’s search scope, we focus on enabling information to actively propagate to the solver through heat diffusion. By transforming the target function while preserving its optima, heat diffusion facilitates information flow from distant regions to the solver, providing more efficient navigation. Utilizing heat diffusion, we propose a framework for solving general combinatorial optimization problems. The proposed methodology demonstrates superior performance across a range of the most challenging and widely encountered combinatorial optimizations. Echoing recent advancements in harnessing thermodynamics for generative artificial intelligence, our study further reveals its significant potential in advancing combinatorial optimization. The codebase of our study is available in https://github.com/AwakerMhy/HeO.

1 Introduction

Combinatorial optimization problems are prevalent in various applications, encompassing circuit design [1], machine learning [2], computer vision [3], molecular dynamics simulation [4], traffic flow optimization [5], and financial risk analysis [6]. This widespread application creates a significant demand for accelerated solutions to these problems. Alongside classical algorithms, which encompass both exact solvers and metaheuristics [7], recent years have seen remarkable advancements in addressing combinatorial optimization. These include quantum adiabatic approaches [8, 9, 10], simulated bifurcation [11, 12, 13], coherent Ising machine [14, 15], high-order Ising machine [16], and deep learning techniques [17, 18]. However, due to the exponential growth of the solution number, finding the optima within a limited computational budget remains a daunting challenge.

Our primary focus is on iterative approximation solvers, which constitute a significant class of combinatorial optimization methods. An iterative approximation solvers typically begin with an initial solution and iteratively improve it by finding better solutions within the neighborhood of the current solution, known as the search scope or more vividly, receptive field. However, due to combinatorial explosion, as the scope of the receptive field increases, the number of solutions to be assessed grows exponentially, making a thorough evaluation of all these solutions expensive. As a result, current approaches are limited to a narrow receptive field, rendering them blind to distant regions in the solution space and heightening the risk of getting trapped in local minimas or areas with bumpy landscapes. Although methods like large neighborhood search [19], variable neighborhood search [20] and path auxiliary sampling [21] are designed to broaden the search scope, they can only gather a modest increment of information from the expanded search scope. Consequently, the current solvers’ receptive field remains significantly constrained, impeding their efficiency.

Refer to caption — Figure 1: The Heat diffusion optimization (HeO) framework. The efficiency of searching a key in a dark room is significantly improved by employing navigation that utilizes heat emission from the key. In combinatorial optimization, heat diffusion transforms the target function into different versions while preserving the location of the optima. Therefore, the gradient information of these transformed functions cooperatively help to optimize the original target function.

In this study, we approach the prevalent limitation stated above from a unique perspective. Instead of expanding the solver’s receptive field to acquire more information from the solution space, we concentrate on propagating information from distant areas of the solution space to the solver via heat diffusion [22]. To illustrate, imagine a solver searching for the optima in the solution space akin to a person searching for a key in a dark room, as depicted in Fig. 1. Without light, the person is compelled to rely solely on touching his surrounding space. The tactile provides only localized information, leading to inefficient navigation. This mirrors the current situation in combinatorial optimization, wherein the receptive field is predominantly confined to local information. However, if the key were to emit heat, its radiating warmth would be perceptible from a distance, acting as a directional beacon. This would significantly enhance navigational efficiency for finding the key.

Motivated by the above metaphor, we propose a simple but efficient framework utilizing heat diffusion to solve various combinatorial optimization problems. Heat diffusion transforms the target function into different versions, within which the information from distant regions actively flow toward the solver. Crucially, the backward uniqueness of the heat equation [23] guarantees that the original problem’s optima are unchanged under these transformations. Therefore, information of target functions under different heat diffusion transformations can be cooperatively employed for optimize the original problem (Fig. 1). Empirically, our framework demonstrates superior performance compared to advanced algorithms across a diverse range of combinatorial optimization instances, spanning quadratic to polynomial, binary to ternary, unconstrained to constrained, and discrete to mixed-variable scenarios. Mirroring the recent breakthroughs in generative artificial intelligence through diffusion processes [24], our research further reveals the potential of heat diffusion, a related thermodynamic phenomenon, in enhancing combinatorial optimization.

2 Failure of gradient-based combinatorial optimization

We will first formulate the general combinatorial optimization problem and then reframe it in a gradient descent-based manner. This reformulation allows us to utilize heat diffusion later. Various combinatorial optimization problems can be naturally formalized as a pseudo-Boolean optimization (PBO) problem [25], in which we aim to find the minima of a real-value target function $f\in\mathbb{R}^{n}\mapsto\mathbb{R}$ subjecting to a binary constraints

\displaystyle\min_{\mathbf{s}\in\{-1,1\}^{n}}f(\mathbf{s}),

(1)

where $\mathbf{s}$ is binary configuration (bits), and $f(\cdot)$ is the target function. Through the transformation $(\mathbf{s}+1)/2$ , our definition aligns with that in [26], where elements of $\mathbf{s}$ take $0$ or $1$ . Given the advanced development of gradient-based algorithms, we are interested in converting the discrete optimization problem into a differentiable one, thereby enabling gradient descent. To achieve this purpose, we encode the bits $s_{i}$ , $i=1,\ldots,n$ as independent Bernoulli variables $p(s_{i}=\pm 1|\bm{\theta})=0.5\pm(\theta_{i}-0.5)$ with $\theta_{i}\in[0,1]$ . In this way, we convert the original combinatorial optimization problem into

\displaystyle\min_{\bm{\theta}\in\mathcal{I}}h(\bm{\theta}),

(2)

where $\mathcal{I}:=[0,1]^{n}$ , and $h(\bm{\theta})=\mathbb{E}_{p(\mathbf{s}|\bm{\theta})}[f(\mathbf{s})]$ . The minima $\bm{\theta}^{\ast}$ of Eq. (2) is $\bm{\theta}^{\ast}=0.5(\mathrm{sgn}(\mathbf{s}^{\ast})+1)$ , given that $\mathbf{s}^{\ast}$ is a minima of the original problem Eq. (1). Here, $\mathrm{sgn}(\cdot)$ is the element-wise sign function. Now Eq. (2) can be solved through gradient descent starting from a given initial $\bm{\theta}_{0}$

\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\gamma\bigtriangledown_{\bm{% \theta}}h(\bm{\theta}_{t}),\quad t=1,\ldots,T,

(3)

where $\gamma$ is the learning rate and $T$ is the iteration number. Unfortunately, this yields a probability distribution $p(\mathbf{s}|\bm{\theta})$ over the configuration space $\{-1,1\}^{n}$ , instead of a deterministic binary configuration $\mathbf{s}$ as desired. Although we can manually binarize $\bm{\theta}$ through $\mathbf{B}(\bm{\theta}):=\mathrm{sgn}(\bm{\theta}-0.5)$ to get the binary configuration which maximizes probability $p(\mathbf{s}|\bm{\theta})$ , the outcome $f(\mathbf{B}(\bm{\theta}))$ may be much higher than $h(\bm{\theta})$ , resulting in significant performance degradation [27]. This suggests that a good gradient-based optimizer should efficiently diminish the uncertainty in the output distribution $p(\mathbf{s}|\bm{\theta})$ , which can be measured by its total variance

\displaystyle V(\bm{\theta})=\sum_{i=1}^{n}\theta_{i}(1-\theta_{i}).

(4)

2.1 Monte Carlo gradient estimation

Conventionally, we can solve the problem Eq. (2) by approximating the gradient of $h(\bm{\theta})$ via Monte Carlo gradient estimation (MCGE) [28] (Alg. 2, Appendix), in which we estimate the gradient in Eq. (3) as

\displaystyle\begin{aligned} \bigtriangledown_{\bm{\theta}}h(\bm{\theta})=% \mathbb{E}_{p(\mathbf{s}|\bm{\theta})}[f(\mathbf{s})\bigtriangledown_{\bm{% \theta}}\log p(\mathbf{s}|\bm{\theta})]\approx\frac{1}{M}\sum_{m=1}^{M}f(% \mathbf{s}^{(m)})\bigtriangledown_{\bm{\theta}}\log p(\mathbf{s}^{(m)}|\bm{% \theta}),\end{aligned}

(5)

where $\mathbf{s}^{(m)}\sim_{i.i.d.}p(\mathbf{s}|\bm{\theta}),\quad m=1,\ldots,M$ . However, it turns out that MCGE performs poorly even equipped with momentum, compared to existing solvers such as simulated annealing and Hopfield neural network, as shown in Fig. 2. We interpret this result as follows. Although MCGE turns the combinatorial optimization problem into a differentiable one, it does not reduce any inherent complexity of the original problem, which may contains a convoluted landscape. As gradient only provides local information, MCGE is susceptible to be trapped in local minimas.

3 Heat diffusion optimization

The inferior performance of the MCGE is attributed to the narrow receptive field of the gradient descent. To overcome this drawback, we manage to provide more efficient navigation to the solver by employing heat diffusion [29], which propagates information from distant region to the solver. Intuitively, consider the parameter space as a thermodynamic system, where each parameter $\bm{\theta}$ is referred to as a location and is associated with an initial temperature value $-h(\bm{\theta})$ , as shown in Fig. 1. Then the optimization procedure can be described as the process that the solver is walking around the parameter space to find the location $\bm{\theta}^{\ast}$ with the highest temperature (or equivalently, the global minima of $h(\bm{\theta})$ ). As time progresses, heat flows obeying the Newton’s law of cooling, leading to an evolution of the temperature distribution across time. The heat at $\bm{\theta}^{\ast}$ flows towards surrounding areas, ultimately reaching the solver’s location. This provides valuable information for the solver, as it can trace the direction of heat flow to locate the $\bm{\theta}^{\ast}$ .

3.1 Heat diffusion on the parameter space

Now we introduce heat diffusion for combinatorial optimization. We extent the parameter space of $\bm{\theta}$ from $[0,1]^{n}$ to $\bar{\mathbb{R}}^{n}$ with $\bar{\mathbb{R}}=\mathbb{R}\cup\{-\infty,+\infty\}$ . To keep the probabilistic distribution $p(\mathbf{s}|\bm{\theta})$ meaningful for $\bm{\theta}\notin[0,1]^{n}$ , we now redefine $p(s_{i}=\pm 1|\bm{\theta})=\mathrm{clamp}(0.5\pm(\theta_{i}-0.5),0,1)$ , where the clamp function is calculated as $\mathrm{clamp}(x,0,1)=\max(0,\min(x,1))$ . Denote the temperature at location $\bm{\theta}$ and time $\tau$ as $u(\tau,\bm{\theta})$ , which is the solution to the following unbounded heat equation [29]

\displaystyle\left\{\begin{matrix}\partial_{\tau}u(\tau,\bm{\theta})&=&\Delta_% {\bm{\theta}}u(\tau,\bm{\theta}),&\quad\tau>0,&\quad\bm{\theta}\in\mathbb{R}^{% n}\\ u(\tau,\bm{\theta})&=&h(\bm{\theta}),&\quad\tau=0,&\quad\bm{\theta}\in\mathbb{% R}^{n}\\ \end{matrix}\right.,

(6)

where $\Delta$ is the Laplacian operator: $\Delta g(\mathbf{x})=\sum_{i=1}^{n}\partial_{x_{i}}g(\mathbf{x})$ . For $\bm{\theta}\in\bar{\mathbb{R}}^{n}/\mathbb{R}^{n}$ , we define $u(\tau,\bm{\theta})=\lim_{\bm{\theta}_{n}\rightarrow\bm{\theta}}u(\tau,\bm{% \theta}_{n})$ , where $\{\bm{\theta}_{n}\}$ is a sequence in $\mathbb{R}^{n}$ converged to $\bm{\theta}$ . Heat equation in the combinatorial optimization exhibits two beneficial characteristics. Firstly, the propagation speed of heat is infinite [22], implying that the information can reach the solver instantaneously. Secondly, the location of the global minima does not change across time $\tau$ , as demonstrated in the following theorem.

Theorem 1.

For any $\tau>0$ , the function $u(\tau,\bm{\theta})$ and $h(\bm{\theta})$ has the same global minima in $\bar{\mathbb{R}}^{n}$

\displaystyle{\arg\min}_{\bm{\theta}\in\bar{\mathbb{R}}^{n}}u(\tau,\bm{\theta}% )={\arg\min}_{\bm{\theta}\in\bar{\mathbb{R}}^{n}}h(\bm{\theta})

(7)

Consequentially, we can generalize the gradient descent approach Eq. (3) by substituting the function $h(\bm{\theta})$ with $u(\tau_{t},\bm{\theta})$ for different $\tau_{t}>0$ at different iteration step $t$ in Eq. (3), as follows

\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\gamma\bigtriangledown_{\bm{% \theta}}u(\tau_{t},\bm{\theta}_{t}),

(8)

where the subscript ’_t’ in $\tau_{t}$ means that $\tau_{t}$ can vary across different steps. In this way, the solver can receive the gradient information about distant region of the landscape that is propagated by the heat diffusion, resulting in a more efficient navigation. However, Eq. (8) will converge to $\check{\theta}_{i}=\begin{cases}+\infty,\quad s_{i}^{\ast}=+1\\ -\infty,\quad s_{i}^{\ast}=-1\end{cases}$ , since $\bm{\theta}_{t}$ are unbounded. To make the procedure Eq. (8) practicable, we project the $\bm{\theta}_{t}$ back to $\mathcal{I}$ after gradient descent at each iteration

\displaystyle\bm{\theta}_{t+1}=\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{% t}-\gamma\bigtriangledown_{\bm{\theta}}u(\tau_{t},\bm{\theta}_{t})\big{)},

(9)

so that $\bm{\theta}_{t}\in\mathcal{I}$ always holds, where we define the projection as $\mathrm{Proj}_{\mathcal{I}}(\mathbf{x})_{i}=\min(1,\max(0,x_{i}))$ for $i=1,\ldots,n$ and for $\mathbf{x}\in\mathbb{R}^{n}$ . Eq. (9) is a reasonable update rule for finding the minimum $\bm{\theta}^{\ast}$ within $\mathcal{I}$ for the following two reasons: (1) The projection of the $\check{\bm{\theta}}$ is the minimum of $h(\bm{\theta})$ in $\mathcal{I}$ , i.e., $\mathrm{Proj}_{\mathcal{I}}\big{(}\check{\bm{\theta}}\big{)}=\bm{\theta}^{\ast}$ ; (2) Due to the property of the projection, if the solver moves towards $\check{\bm{\theta}}$ , it also gets closer to $\bm{\theta}^{\ast}$ . More importantly, since the coordinates of $\check{\bm{\theta}}$ are all infinite, the convergent point of Eq. (9) must be one of the vertices of $\mathcal{I}$ , i.e., $\{0,1\}^{n}$ . This suggests that Eq. (9) tends to give an output $\bm{\theta}$ that diminishes the uncertainty $V(\bm{\theta})$ (Eq. (4)).

3.2 Solving the heat equation

To develop an algorithm for solving the combinatorial optimization problem from Eq. (9), we must solve the heat equation Eq. (6), which seems a significant challenge when the dimension $n$ is high. Fortunately, the solution has a closed form if the target function $f(\mathbf{s})$ can be written as a multi-linear polynomial of $\mathbf{s}$

\displaystyle\begin{aligned} f(\mathbf{s})=a_{0}+\sum_{i_{1}}a_{1,i_{1}}s_{i_{% 1}}+\sum_{i_{1}<i_{2}}a_{2,i_{1}i_{2}}s_{i_{1}}s_{1_{2}}+\cdots+\sum_{i_{1}<% \cdots<i_{K}}a_{K,i_{1}\ldots i_{K}}s_{i_{1}}\cdots s_{i_{K}},\end{aligned}

(10)

a condition met by a wide range of combinatorial optimization problems [16].

Theorem 2.

Supposed that $f(\mathbf{s})$ is a multilinear polynomial of $\mathbf{s}$ , then the solution to Eq. (6) is

\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{x})}[f(\mathrm{erf}(% \frac{\bm{\theta}-\mathbf{x}}{\sqrt{\tau}}))],\quad\mathbf{x}\in\mathrm{Unif}[% 0,1]^{n},

(11)

where $\mathrm{erf}(x)=\frac{2}{\sqrt{\pi}}\int_{0}^{x}e^{-t^{2}}dt$ is the error function.

For more general cases, we still use the approximation Eq. (11).

3.3 Proposed algorithm

Based on Eq. (9) and Thm. 2, we proposed Heat diffusion optimization (HeO), a gradient-based algorithm for combinatorial optimization, as illustrated in Alg. 1, where we estimate Eq. (11) with one sample $\mathbf{x}\sim\mathrm{Unif}[0,1]^{n}$ , and we denote $\sqrt{\tau}_{t}=\sigma_{t}$ for short. Our HeO can be equipped with momentum, which is shown in Alg. 3 in Appendix. In contrast to those methods designed for solving special class of PBO (Eq. (1)) such as quadratic unconstrained binary optimization (QUBO), our HeO can directly solve PBO problems with general form. Although PBO can be represented as QUBO [30], this necessitates the introduction of auxiliary variables, which may consequently increase the problem size and leading to additional computational overhead [31]. Compared to other algorithms, our HeO has relatively low complexity. The most computationally intensive operation at each step is gradient calculation, which can be explicitly expressed or efficiently computed with tools like PyTorch’s autograd, and can be accelerated using GPUs. As shown in Fig. S1 of the in Appendix, the time cost per iteration of our methods increases linearly with the problem dimension, with a small constant coefficient. Since the overall computational time mainly depends on the number of iterations times the time cost per iteration, our HeO is efficient even in high-dimensional cases.

Algorithm 1 Heat diffusion optimization (HeO)

Input: target function

f(\cdot)

, step size

\gamma

\sigma

schedule

\{\sigma_{t}\}

, iteration number

T

initialize elements of

\bm{\theta}_{0}

0.5

for

t=0

T-1

sample

\mathbf{x}_{t}\sim\mathrm{Unif}[0,1]^{n}

\mathbf{g}_{t}\leftarrow\bigtriangledown_{\bm{\theta}}f(\mathrm{erf}(\frac{\bm% {\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}}))

\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}-% \gamma\mathbf{g}_{t}\big{)}

end for

Output: binary configuration

\mathbf{s}_{T}=\mathrm{sgn}(\bm{\theta}_{T}-0.5)

One counter-intuitive thing is that to minimize the target function $h(\bm{\theta})$ , the HeO actually are minimizing different functions $u(\tau_{t},\bm{\theta})$ at different step $t$ . We interpret this by providing an upper bound for the target optimization loss $h(\bm{\theta})-h(\bm{\theta}^{\ast})$ .

Theorem 3.

Denote $\breve{f}=\max_{\mathbf{s}}f(\mathbf{s})$ . Given $\tau_{2}>0$ and $\epsilon>0$ , there exists $\tau_{1}\in(0,\tau_{2})$ , such that

\displaystyle\begin{aligned} h(\bm{\theta})-h(\bm{\theta}^{\ast})\leq\big{[}(% \breve{f}-f^{\ast})\big{(}u(\tau_{2},\bm{\theta}^{\ast})-u(\tau_{2},\bm{\theta% })+\frac{n}{2}\int_{\tau_{1}}^{\tau_{2}}\frac{u(\tau,\bm{\theta})-u(\tau,\bm{% \theta}^{\ast})}{\tau}d\tau\big{)}\big{]}^{1/2}+\epsilon.\end{aligned}

(12)

Accordingly, minimizing $u(\tau,\bm{\theta})$ for each $\tau$ cooperatively aids in minimizing the original target function $h(\bm{\theta})$ . Thus, we refer to HeO as a cooperative optimization paradigm, as illustrated in Fig. 1.

4 Experiments

We apply our HeO to a variety of NP-hard combinatorial optimization problems to demonstrate its broad applicability. Unless explicitly stated otherwise, we employ the $\tau_{t}$ schedule as $\sqrt{\tau_{t}}=:\sigma_{t}=2(1-t/T)$ for HeO throughout this work. The sensitivity of other parameter settings including the step size $\gamma$ and iterations $T$ are shown in Fig. S2 This choice is motivated by the idea that the reversed direction of heat flow guides the solver towards the original of its source, i.e., the global minima. Noticed that this choice is not theoretically necessary, as elaborated in Discussion.

Toy example. We consider the following target function

\displaystyle f(\mathbf{s})=\mathbf{a}_{2}^{\mathrm{T}}\mathrm{sigmoid}(W% \mathbf{s}+\mathbf{a}_{1})

(13)

where $\mathrm{sigmoid}(x)=\tfrac{1}{1+e^{-x}}$ , and the elements of the network parameters $\mathbf{a}_{1}\in\mathbb{R}^{n}$ , $\mathbf{a}_{2}\in\mathbb{R}^{m}$ , $W\in\mathbb{R}^{m,n}$ are uniformly sampled from $[-1,1]$ and fixed during optimizing. According to the universal approximation theory [32], $f(\mathbf{s})$ can approximate any continuous function with sufficiently large $m$ , thereby representing a general target function. We compare the performance of our HeO with momentum (Alg. 3) against several representative methods: the conventional gradient-based solver MCGE [28] (with or without momentum), the simulated annealing [33], and the Hopfield neural network [34]. As shown in Fig. 2, our HeO demonstrates exceptional superiority over all other methods, and efficiently reduces its uncertainty compared to MCGE.

Quadratic unconstrained binary optimization (QUBO). QUBO is the combinatorial optimization problem with quadratic form target function ( $J\in\mathbb{R}^{n\times n}$ is a symmetric matrix with zero diagonals)

\displaystyle f(\mathbf{s})=\mathbf{s}^{\mathrm{T}}J\mathbf{s}.

(14)

This corresponds to the case where $K=2$ in Eq. (10). A well-known class of QUBO is max-cut problem [27], in which we divide the vertices of a graph into two distinct subsets and aim to maximize the number of edges between them. Its target function is expressed as Eq. (14), where $J$ is determined by the adjacency matrix of the graph.

We compare our HeO with representative iterative approximation methods especially developed for solving QUBO including LQA [10], aSB [12], bSB [13], dSB [13], CIM [35], and SIM-CIM [15] on max-cut problems in the Biq Mac Library [36]¹¹1https://biqmac.aau.at/biqmaclib.html. We report the relative loss averaged over all instances and the count of the instances where each algorithm gives the bottom-2 worse result among the 7 algorithms. As shown in Fig. 3, our HeO is superior to other methods in terms of two metrics.

Polynomial unconstrained binary optimization (PUBO). PUBO is a class of combinatorial optimization problems, in which higher-order interactions between bits $s_{i}$ appears in the target function. Existing methods for solving PUBO fall into two categories: the first approach involves transforming PUBO into QUBO by adding auxiliary variables through a quadratization process, and then solving it as a QUBO problem [38], and the one directly solves PUBO [16]. Quadratization may dramatically increases the dimension of the problem, hence brings heavier computational overhead, while our HeO can be directly used for solving PUBO. A well-known class of PUBO is the Boolean $3$ -satisfiability ( $3$ -SAT) problem [27], which involves determining the satisfiability of a Boolean formula over $n$ Boolean variables $b_{1},\ldots,b_{n}$ where $b_{i}\in\{0,1\}$ . The Boolean formula is structured in Conjunctive Normal Form (CNF) consisting of $H$ conjunction ( $\wedge$ ) of clauses, and each clause $h$ is a disjunction ( $\vee$ ) of exactly three literals (either a Boolean variable or its negation). An algorithm of $3$ -SAT aims to find the Boolean variables that makes as many as clauses satisfied.

To apply our HeO to the 3-SAT, we encode each Boolean variable $b_{i}$ as $s_{i}$ , which is assigned with value $1$ if $b_{i}=1$ , otherwise $s_{i}=-1$ . For a literal, we define a value $c_{h_{i}}$ , which is $-1$ if the literal is the negation of the corresponding Boolean variable, otherwise it is $1$ . Then finding the Boolean variables that satisfies as many as clauses is equivalent to minimize the target function

\displaystyle f(\mathbf{s})=\sum_{h=1}^{H}\prod_{i=1}^{3}\frac{1-c_{h_{i}}s_{h% _{i}}}{2}.

(15)

This corresponds to the case where $K=3$ in Eq. (10). We compared our HeO (Alg. 4, Appendix) with the second-order oscillator Ising machines (OIM) solver that using quadratization and the state-of-art 3-order OIM proposed in [16] on 3-SAT instance in SATLIB ²²2https://www.cs.ubc.ca/~hoos/SATLIB/benchm.html. As shown in Fig. 4, our HeO is superior to other methods in attaining higher quality of solutions and finding more the complete satisfiable solution (solutions that satisfying all clauses). Notably, for the cases of 175-250 variables, our HeO is able to find more complete satisfiable solutions, compared to the 3-order OIM, while the 2-order OIM fails to find any complete solutions [16].

Ternary optimization. Neural networks excel in learning and modeling complex functions, but they also bring about considerable computational demands due to their vast number of parameters. A promising strategy to mitigate this issue is quantization, which converts network parameters into discrete values [39]. However, directly training networks with discrete parameters introduces a significant challenge due to the high-dimensional combinatorial optimization problem it presents.

We apply our HeO and MCGE to directly train neural networks with ternary value ( ${-1,0,1}$ ). Supposed that we have an input-output dataset $\mathcal{D}$ generated by a ground-truth ternary single-layer perceptron $\mathbf{y}=\bm{\Gamma}(\mathbf{v};W_{\mathrm{GT}})=\mathrm{Relu}(W_{\mathrm{GT% }}\mathbf{v})$ , where $\mathrm{Relu}(x)=\max\{0,x\}$ , $W_{\mathrm{GT}}\in\{-1,0,1\}^{m\times n}$ is the ground-truth ternary weight parameter, $\mathbf{v}\in\{-1,0,1\}^{n}$ is the input, and $\mathbf{y}$ is the model output. We aim to find the ternary configuration $W\in\{-1,0,1\}^{m\times n}$ minimizing the loss $\mathrm{MSE}(W,\mathcal{D})=\tfrac{1}{\left|\mathcal{D}\right|}\sum_{(\mathbf{% v},\mathbf{y})\in\mathcal{D}}\left\|\bm{\Gamma}(\mathbf{v};W)-\mathbf{y}\right% \|^{2}$ . We generalize our HeO from the binary to the ternary case by representing a ternary variable with two bits, where each element of $W$ can be represented as a function of $\mathbf{s}$ (see Alg. 5, Appendix for details), and the target function is defined as

\displaystyle f(\mathbf{s})=\frac{1}{\left|\mathcal{D}\right|}\sum_{(\mathbf{v% },\mathbf{y})\in\mathcal{D}}\left\|\bm{\Gamma}(\mathbf{v};W(\mathbf{s}))-% \mathbf{y}\right\|^{2}.

(16)

As shown in Fig. 5, HeO robustly exceeds MCGE under different dataset size $|\mathcal{D}|$ and output size $m$ .

Mixed combinatorial optimization. In high-dimensional linear regression, usually only a small fraction of the variables significantly contribute to prediction. Identifying and selecting a subset of variables with strong predictive power—a process known as variable selection—is crucial, as it improves the generalizability and interpretability of the regression model [42]. However, direct variable selection is an NP-hard combinatorial optimization mixed with continuous variables [43]. As a practical alternative, regularization methods like Lasso algorithm are commonly employed [40].

Supposed a dataset is generated from a linear model, in which the relation between input $\mathbf{v}$ and output $y$ is $y=\bm{\beta}^{\ast}\cdot\mathbf{v}+\epsilon$ , where $\bm{\beta}^{\ast}$ is the ground-truth linear coefficient and $\epsilon$ is independent Gaussian noise with standard deviation $\sigma_{\epsilon}$ . Suppose that only a small proportion (denoted as $q\in(0,1)$ ) of coordinates in $\beta_{i}^{\ast}$ are non-zero. Our goal is to identify these coordinates through an indicator vector $\mathbf{s}\in\{-1,1\}^{n}$ ( $1$ for selection and $-1$ for non-selection) and estimate these non-zero coefficients. The target function of the problem is ( $\mathbf{1}\in\mathbb{R}^{n}$ is the all-one vector)

\displaystyle f(\mathbf{s},\bm{\beta})=\frac{1}{\left|\mathcal{D}\right|}\sum_% {(\mathbf{v},\mathbf{y})\in\mathcal{D}}\left|\big{(}\bm{\beta}\odot\frac{% \mathbf{s}+\mathbf{1}}{2}\big{)}\cdot\mathbf{v}-\mathbf{y}\right|^{2}.

(17)

We solve the problem through HeO (Alg. 6, Appendix), where we minimize the loss relative to $\bm{\theta}$ while slowing varying $\bm{\beta}$ via its error gradient. After obtaining the indicator $\mathbf{s}$ , we conduct an ordinary least squares regression on the variables selected by the $\mathbf{s}$ to estimate their coefficients, and treat other variables’ coefficients as zero. As shown in Fig.6, our HeO outperforms both Lasso regression and the more advanced $L_{0.5}$ regression [41] in terms of producing more accurate indicators $\mathbf{s}$ and achieving lower test prediction errors across various $q$ and $\sigma_{\epsilon}$ settings. Importantly, to give the variable selection prediction, our HeO does not need to know the level of $q$ and $\sigma_{\epsilon}$ in advance.

Table 1: Attributes of real world datasets and the vertex cover sizes of HeO (Alg. 7, Appendix) and FastVC.

Graph name	Vertex number	Edge number	FastVC	HeO
tech-RL-caida	190914	607610	78306	77372 (17)
soc-youtube	495957	1936748	149458	148875 (25)
inf-roadNet-PA	1087562	1541514	588306	587401 (104)
inf-roadNet-CA	1957027	2760388	1063352	1061339 (32)
socfb-B-anon	2937612	20959854	338724	312531 (194)
socfb-A-anon	3097165	23667394	421123	387730 (355)
socfb-uci-uni	58790782	92208195	869457	867863 (36)

Constrained binary optimization.

Minimum vertex cover (MVC) is a class of the constrained combinatorial optimization which has wide applications [44], as illustrated in Fig. 7. Given an undirected graph $\mathcal{G}$ with vertex set $\mathcal{V}$ and edge set $\mathcal{E}$ , the MVC is to find the minimum subset $\mathcal{V}_{c}\subset\mathcal{V}$ , so that for each edge $e\in\mathcal{E}$ , at least one of its endpoints belongs to $\mathcal{V}_{c}$ . The target function and the constrains are expressed as

\displaystyle f(\mathbf{s})=\sum_{i=1}^{n}\frac{s_{i}+1}{2},\quad\text{subject% to }g_{ij}(\mathbf{s})=(1-\frac{s_{i}+1}{2})(1-\frac{s_{j}+1}{2})=0,\quad% \forall i,j,e_{ij}\in\mathcal{E}.

(18)

We combine the HeO with penalty function (Alg. 7, Appendix) for solving MVC. We compare our HeO with FastVC [45], a powerful MVC heuristic algorithm, on massive real world graph datasets ³³3http://networkrepository.com/. For a fair comparison, we keep the run time of two algorithms as the same for each dataset. As shown in Tab. 1, our HeO can find smaller cover sets than that of FastVC.

5 Discussion

Existing model-based combinatorial optimization approaches encode the solution space via a parameterized distribution with iterative parameter updates [46]. In contrast to HeO, which requires only one sample per iteration, they necessitate a large number of samples per iteration. The Gibbs-With-Gradient algorithm [47] uses gradient information for combinatorial optimization but searches in the discrete solution space instead of the continuous one, as did HeO. Denoising diffusion model (DDM) [24] has been applied for solving combinatorial optimization problems [48]. Although the diffusion process in DDMs akin to the heat diffusion in our HeO, DDMs require a substantial data for training and necessitate reversing the diffusion process to generate data that from the target distribution. In contrast, HeO needs no training, and it is unnecessary to strictly adhering to the reverse time sequence $\tau_{t}$ in the optimization process, as under different $\tau$ , the function $u(\tau,\bm{\theta})$ shares the same optima with that of the original problem $h(\bm{\theta})$ . This claim is empirically corroborated in Fig. S3, Appendix, where HeO applying non-monotonic schedules of $\tau_{t}$ still demonstrates superior performance.

Our HeO can be viewed as a stochastic Gaussian continuation (GC) method [49] with projection, which has been applied for non-convex optimization, though it has not yet been used for combinatorial optimization. The optimal convexification of GC [50] underpins potentially theoretical advantages of HeO. One key distinction is that GC typically optimizes each sub-problem (corresponding to $u(\bm{\theta},\tau_{t})$ ) at each $t$ up to some criteria, whereas HeO merely performs a single-step gradient descent. Also, Eq. (8) corresponds to a variation of the evolution strategy (ES), a robust optimizer for non-differentiable function [51], while HeO use a different gradient estimation (Eq. (11)), see Appendix for details. Additionally, our HeO is related to randomized smoothing, which has been applied to non-smooth optimization [52] or neural network regularization [53]. The distinctive feature of our HeO is that, across different $\tau$ , the smoothed function $u(\tau,\bm{\theta})$ retains the optima of the original function $h(\bm{\theta})$ (Thm. 1). This distinguishes our HeO from methods based on quantum adiabatic theory [9], bifurcation theory [12] and other relaxation strategies [54], in which the optima of the smoothed function can be different from the original one [27, 55]. This is empirically verified in Fig. S3, Appendix.

The heat equation in our HeO can be naturally extended to general parabolic differential equations, given that a broad spectrum of them obey the backward uniqueness [56]. For example, we can use $\partial_{\tau}u(\tau,\bm{\theta})=\bigtriangledown_{\bm{\theta}}[A% \bigtriangledown_{\bm{\theta}}u(\tau,\bm{\theta})]$ to replace the Eq. (6), where $A$ is a real positive definite matrix. Prior researches have demonstrated that the optimization procedure can be significantly accelerated by preconditioning [57] or Fisher information matrix [58], implying that choosing a proper matrix $A$ could substantially improve the efficacy of the HeO. Additionally, since $\tau_{t}$ does not necessarily have to be monotonic due to the cooperative optimization property of HeO, it is feasible to explore various $\tau_{t}$ schedules (possibly non-monotonic) to further enhance performance. Moreover, our HeO can be integrated with other existing techniques for combinatorial optimization problems with constraints, such as Augmented Lagrangian methods, to achieve better performance [59].

Despite the effectiveness of HeO on the combinatorial optimization problems from different domains we have considered, it has limitations. First, current HeO is inefficient for integer linear programming and routing problems, primarily due to that it is cumbersome to encode integer variables through the Bernoulli distribution in our framework. Nevertheless, integrating HeO with other techniques such as advanced Metropolis-Hastings algorithm [60] may path to broaden the applicability of our methodology to a wider range of combinatorial optimization problems. Besides, our HeO allows for further customization by incorporating additional terms that integrate problem-specific prior knowledge or by hybridizing with other metaheuristic algorithms, allowing for more effective exploration of the configuration space. Second, our HeO can not be theoretically guaranteed for converging to the global minimum. In general, finding the global minimum is not theoretically guaranteed for non-convex optimization problems [61], such as the combinatorial optimization problems studied in this paper. However, it can be demonstrated that the gradient of the target function under heat diffusion satisfies the inequality [22]:

\displaystyle\left|\bigtriangledown_{\bm{\theta}}u(\tau,\bm{\theta})\right|% \leq\frac{C}{\sqrt{\tau}},

where the constant $C$ depends on the dimension. This implies that the target function becomes weakly convex, enabling the finding of global minima and faster convergence under certain conditions [62].

6 Conclusion

In conclusion, grounded in the heat diffusion, we present a framework called Heat diffusion optimization (HeO) to solve various combinatorial optimization problems. The heat diffusion facilitates the propagation of information from distant regions to the solver, expanding its receptive field, which in turn enhances its ability to search for global optima. Demonstrating exceptional performance across various scenarios, our HeO highlights the potential of utilizing heat diffusion to address challenges associated with navigating the solution space of combinatorial optimization.

References

[1] Francisco Barahona, Martin Grötschel, Michael Jünger, and Gerhard Reinelt. An application of combinatorial optimization to statistical physics and circuit layout design. Operations Research, 36(3):493–513, 1988.
[2] Jun Wang, Tony Jebara, and Shih-Fu Chang. Semi-supervised learning using greedy max-cut. The Journal of Machine Learning Research, 14(1):771–800, 2013.
[3] Chetan Arora, Subhashis Banerjee, Prem Kalra, and SN Maheshwari. An efficient graph cut algorithm for computer vision problems. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part III 11, pages 552–565. Springer, 2010.
[4] Hayato Ushijima-Mwesigwa, Christian FA Negre, and Susan M Mniszewski. Graph partitioning using quantum annealing on the d-wave system. In Proceedings of the Second International Workshop on Post Moores Era Supercomputing, pages 22–29, 2017.
[5] Florian Neukart, Gabriele Compostella, Christian Seidel, David Von Dollen, Sheir Yarkoni, and Bob Parney. Traffic flow optimization using a quantum annealer. Frontiers in ICT, 4:29, 2017.
[6] Román Orús, Samuel Mugel, and Enrique Lizaso. Quantum computing for finance: Overview and prospects. Reviews in Physics, 4:100028, 2019.
[7] Rafael Marti and Gerhard Reinelt. Exact and Heuristic Methods in Combinatorial Optimization, volume 175. Springer, 2022.
[8] Edward Farhi, Jeffrey Goldstone, Sam Gutmann, Joshua Lapan, Andrew Lundgren, and Daniel Preda. A quantum adiabatic evolution algorithm applied to random instances of an np-complete problem. Science, 292(5516):472–475, 2001.
[9] John A Smolin and Graeme Smith. Classical signature of quantum annealing. Frontiers in physics, 2:52, 2014.
[10] Joseph Bowles, Alexandre Dauphin, Patrick Huembeli, José Martinez, and Antonio Acín. Quadratic unconstrained binary optimization via quantum-inspired annealing. Physical Review Applied, 18(3):034016, 2022.
[11] Mária Ercsey-Ravasz and Zoltán Toroczkai. Optimization hardness as transient chaos in an analog approach to constraint satisfaction. Nature Physics, 7(12):966–970, 2011.
[12] Hayato Goto, Kosuke Tatsumura, and Alexander R Dixon. Combinatorial optimization by simulating adiabatic bifurcations in nonlinear hamiltonian systems. Science advances, 5(4):eaav2372, 2019.
[13] Hayato Goto, Kotaro Endo, Masaru Suzuki, Yoshisato Sakai, Taro Kanao, Yohei Hamakawa, Ryo Hidaka, Masaya Yamasaki, and Kosuke Tatsumura. High-performance combinatorial optimization based on classical mechanics. Science Advances, 7(6):eabe7953, 2021.
[14] Takahiro Inagaki, Yoshitaka Haribara, Koji Igarashi, Tomohiro Sonobe, Shuhei Tamate, Toshimori Honjo, Alireza Marandi, Peter L McMahon, Takeshi Umeki, Koji Enbutsu, et al. A coherent ising machine for 2000-node optimization problems. Science, 354(6312):603–606, 2016.
[15] Egor S Tiunov, Alexander E Ulanov, and AI Lvovsky. Annealing by simulating the coherent ising machine. Optics express, 27(7):10288–10295, 2019.
[16] Connor Bybee, Denis Kleyko, Dmitri E Nikonov, Amir Khosrowshahi, Bruno A Olshausen, and Friedrich T Sommer. Efficient optimization with higher-order ising machines. Nature Communications, 14(1):6033, 2023.
[17] Martin JA Schuetz, J Kyle Brubaker, and Helmut G Katzgraber. Combinatorial optimization with physics-inspired graph neural networks. Nature Machine Intelligence, 4(4):367–377, 2022.
[18] Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, pages 1–3, 2023.
[19] David Pisinger and Stefan Ropke. Large neighborhood search. Handbook of metaheuristics, pages 99–127, 2019.
[20] Pierre Hansen, Nenad Mladenović, and Jose A Moreno Perez. Variable neighbourhood search: methods and applications. Annals of Operations Research, 175:367–407, 2010.
[21] Haoran Sun, Hanjun Dai, Wei Xia, and Arun Ramamurthy. Path auxiliary proposal for mcmc in discrete space. In International Conference on Learning Representations, 2021.
[22] Lawrence C Evans. Partial differential equations, volume 19. American Mathematical Society, 2022.
[23] Jean-Michel Ghidaglia. Some backward uniqueness results. Nonlinear Analysis: Theory, Methods & Applications, 10(8):777–790, 1986.
[24] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
[25] Endre Boros and Peter L Hammer. Pseudo-boolean optimization. Discrete applied mathematics, 123(1-3):155–225, 2002.
[26] Yves Crama and Peter L Hammer. Boolean functions: Theory, algorithms, and applications. Cambridge University Press, 2011.
[27] Bernhard H Korte, Jens Vygen, B Korte, and J Vygen. Combinatorial optimization, volume 1. Springer, 2011.
[28] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte carlo gradient estimation in machine learning. The Journal of Machine Learning Research, 21(1):5183–5244, 2020.
[29] David Vernon Widder. The heat equation, volume 67. Academic Press, 1976.
[30] Avradip Mandal, Arnab Roy, Sarvagya Upadhyay, and Hayato Ushijima-Mwesigwa. Compressed quadratization of higher order binary optimization problems. In Proceedings of the 17th ACM International Conference on Computing Frontiers, pages 126–131, 2020.
[31] Martin Anthony, Endre Boros, Yves Crama, and Aritanan Gruber. Quadratic reformulations of nonlinear binary optimization problems. Mathematical Programming, 162:115–144, 2017.
[32] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
[33] Michel Gendreau, Jean-Yves Potvin, et al. Handbook of metaheuristics, volume 2. Springer, 2010.
[34] John J Hopfield and David W Tank. “neural” computation of decisions in optimization problems. Biological cybernetics, 52(3):141–152, 1985.
[35] Zhe Wang, Alireza Marandi, Kai Wen, Robert L Byer, and Yoshihisa Yamamoto. Coherent ising machine based on degenerate optical parametric oscillators. Physical Review A, 88(6):063853, 2013.
[36] Angelika Wiegele. Biq mac library—a collection of max-cut and quadratic 0-1 programming instances of medium size. Preprint, 51, 2007.
[37] Holger H Hoos and Thomas Stützle. Satlib: An online resource for research on sat. Sat, 2000:283–292, 2000.
[38] Andrew Lucas. Ising formulations of many np problems. Frontiers in physics, 2:5, 2014.
[39] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
[40] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
[41] Zongben Xu, Hai Zhang, Yao Wang, XiangYu Chang, and Yong Liang. L 1/2 regularization. Science China Information Sciences, 53:1159–1169, 2010.
[42] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
[43] Xiaoping Li, Yadi Wang, and Rubén Ruiz. A survey on sparse learning models for feature selection. IEEE transactions on cybernetics, 52(3):1642–1660, 2020.
[44] Alexander C Reis, Sean M Halper, Grace E Vezeau, Daniel P Cetnar, Ayaan Hossain, Phillip R Clauer, and Howard M Salis. Simultaneous repression of multiple bacterial genes using nonrepetitive extra-long sgrna arrays. Nature biotechnology, 37(11):1294–1301, 2019.
[45] Shaowei Cai, Jinkun Lin, and Chuan Luo. Finding a small vertex cover in massive sparse graphs: Construct, local search, and preprocess. Journal of Artificial Intelligence Research, 59:463–494, 2017.
[46] Enlu Zhou and Jiaqiao Hu. Gradient-based adaptive stochastic search for non-differentiable optimization. IEEE Transactions on Automatic Control, 59(7):1818–1832, 2014.
[47] Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, and Chris Maddison. Oops i took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pages 3831–3841. PMLR, 2021.
[48] Zhiqing Sun and Yiming Yang. Difusco: Graph-based diffusion solvers for combinatorial optimization. Advances in neural information processing systems, 2023.
[49] Hossein Mobahi and John Fisher III. A theoretical analysis of optimization by gaussian continuation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
[50] Hossein Mobahi and John W Fisher. On the link between gaussian homotopy continuation and convex envelopes. In Energy Minimization Methods in Computer Vision and Pattern Recognition: 10th International Conference, EMMCVPR 2015, Hong Kong, China, January 13-16, 2015. Proceedings 10, pages 43–56. Springer, 2015.
[51] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research, 15(27):949–980, 2014.
[52] John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
[53] Mo Zhou, Tianyi Liu, Yan Li, Dachao Lin, Enlu Zhou, and Tuo Zhao. Toward understanding the importance of noise in training neural networks. In International Conference on Machine Learning, pages 7594–7602. PMLR, 2019.
[54] Nikolaos Karalias and Andreas Loukas. Erdos goes neural: an unsupervised learning framework for combinatorial optimization on graphs. Advances in Neural Information Processing Systems, 33:6659–6672, 2020.
[55] Haoyu Peter Wang, Nan Wu, Hang Yang, Cong Hao, and Pan Li. Unsupervised learning for combinatorial optimization with principled objective relaxation. Advances in Neural Information Processing Systems, 35:31444–31458, 2022.
[56] Jie Wu and Liqun Zhang. Backward uniqueness for general parabolic operators in the whole space. Calculus of Variations and Partial Differential Equations, 58:1–19, 2019.
[57] Hengyuan Ma, Li Zhang, Xiatian Zhu, and Jianfeng Feng. Accelerating score-based generative models with preconditioned diffusion sampling. In European Conference on Computer Vision, pages 1–16. Springer, 2022.
[58] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949–980, 2014.
[59] Ernesto G Birgin and José Mario Martínez. Practical augmented Lagrangian methods for constrained optimization. SIAM, 2014.
[60] Haoran Sun, Katayoon Goshvadi, Azade Nova, Dale Schuurmans, and Hanjun Dai. Revisiting sampling for combinatorial optimization. In International Conference on Machine Learning, pages 32859–32874. PMLR, 2023.
[61] Feng-Yi Liao, Lijun Ding, and Yang Zheng. Error bounds, pl condition, and quadratic growth for weakly convex functions, and linear convergences of proximal point methods. In 6th Annual Learning for Dynamics & Control Conference, pages 993–1005. PMLR, 2024.
[62] Felipe Atenas, Claudia Sagastizábal, Paulo JS Silva, and Mikhail Solodov. A unified analysis of descent sequences in weakly convex optimization, including convergence rates for bundle methods. SIAM Journal on Optimization, 33(1):89–115, 2023.
[63] Paul Bertens and Seong-Whan Lee. Network of evolvable neural units: Evolving to learn at a synaptic level. arXiv preprint, 2019.
[64] Juntao Wang, Daniel Ebler, KY Michael Wong, David Shui Wing Hui, and Jie Sun. Bifurcation behaviors shape how continuous physical dynamics solves discrete ising optimization. Nature Communications, 14(1):2510, 2023.

Supplementary Information

1 Proof of theorems

1.1 Proof of Thm. 1

To prove the Thm. 1, we recall the backward uniqueness of the heat equation [23], which asserts that the initial state of a heat equation can be uniquely determined by its state at a time point $\tau$ under mild conditions, as shown in the following theorem.

Theorem S4.

Given two bounded function $h_{1}(\mathbf{x}),h_{2}(\mathbf{x})$ with domain on $\mathbb{R}^{n}$ . Denote $u_{1}(\tau,\mathbf{x})$ and $u_{2}(\tau,\mathbf{x})$ as the solutions to the heat equation (Eq. (6)) with initial condition $u_{1}(0,\mathbf{x})=h_{1}(\mathbf{x})$ and $u_{2}(0,\mathbf{x})=h_{2}(\mathbf{x})$ respectively. If there exists a $\tau>0$ , such that

\displaystyle u_{1}(\tau,\mathbf{x})=u_{2}(\tau,\mathbf{x}),\quad\forall% \mathbf{x}\in\mathbb{R}^{n},

(S1)

we have $h_{1}=h_{2}$ .

Proof of the Thm. 1.

Proof.

We first show that the global minima of $u(\tau,\bm{\theta})$ is also the global minima of $h(\bm{\theta})$ . The cornerstone of the proof is Thm. S4. To utilize the backward uniqueness, we consider a reparameterization of $p(\mathbf{s}|\bm{\theta})$ by introducing a random variable $\mathbf{x}\sim\mathrm{Unif}(0,1)^{n}$ . It is easy to see that $\mathrm{sgn}(\theta_{i}-x_{i})$ obeys Bernoulli distribution with probability $\theta_{i}$ to be $1$ and $1-\theta_{i}$ to be $-1$ . Replacing $s_{i}$ by $x_{i}$ in Eq. (2), we have a new expression of $h(\bm{\theta})$

\displaystyle h(\bm{\theta})=\mathbb{E}_{p(\mathbf{x})}[f(\mathrm{sgn}(\bm{% \theta}-\mathbf{x}))],\quad\mathbf{x}\sim\mathrm{Unif}(0,1)^{n}.

(S2)

Since the solution to the heat equation can be represented by the heat kernel, we have [22]

\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{z})}[h(\bm{\theta}+% \sqrt{2\tau}\mathbf{z})],\quad p(\mathbf{z})=\mathcal{N}(\mathbf{0},I).

(S3)

Therefore, we have

\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{z})}[\mathbb{E}_{p(% \mathbf{x})}[f(\mathrm{sgn}(\bm{\theta}+\sqrt{2\tau}\mathbf{z}-\mathbf{x})]]=% \mathbb{E}_{p(\mathbf{x})}[\mathbb{E}_{p(\mathbf{z})}[f(\mathrm{sgn}(\bm{% \theta}-(\mathbf{x}+\sqrt{2\tau}\mathbf{z}))]].

(S4)

Denote $u(\tau,\mathbf{x};\bm{\theta})=\mathbb{E}_{p(\mathbf{z})}[f(\mathrm{sgn}(\bm{% \theta}-(\mathbf{x}+\sqrt{2\tau}\mathbf{z})))]$ , $\mathbf{x}\in(0,1)^{n}$ . Noticed that $u(\tau,\mathbf{x};\bm{\theta})$ is the solution of the following unbounded heat equation respect to the time $\tau$ and location $\mathbf{x}$ restricted on the region $\mathbf{x}\in(0,1)^{n}$

\displaystyle\left\{\begin{matrix}\partial_{\tau}u(\tau,\mathbf{x};\bm{\theta}% )&=&\Delta_{\mathbf{x}}u(\tau,\mathbf{x};\bm{\theta}),&\quad\tau>0,&\quad% \mathbf{x}\in\mathbb{R}^{n}\\ u(\tau,\mathbf{x};\bm{\theta})&=&f(\mathrm{sgn}(\bm{\theta}-\mathbf{x})),&% \quad\tau=0,&\quad\mathbf{x}\in\mathbb{R}^{n}\end{matrix}\right.,

(S5)

hence the latter can be considered as an extension of the former. Since $u(\tau,\mathbf{x};\bm{\theta})$ is analytic respect to $\mathbf{x}\in(0,1)^{n}$ for $\tau>0$ , this extension is unique. Therefore, the value of $u(\tau,\mathbf{x};\bm{\theta})$ on $(0,1)^{n}$ , $\tau>0$ uniquely determines the solution of Eq. (S5). Denote $\check{\bm{\theta}}$ as

\displaystyle\check{\theta}_{i}=\begin{cases}+\infty,\quad s_{i}^{\ast}=+1\\ -\infty,\quad s_{i}^{\ast}=-1.\end{cases}

(S6)

Then $u(\tau,\mathbf{x};\check{\bm{\theta}})=f^{\ast}$ , for any $\tau\geq 0$ and $\mathbf{x}\in\mathbb{R}^{n}$ , and we have

\displaystyle u(\tau,\check{\bm{\theta}})=\mathbb{E}_{p(\mathbf{x})}[u(\tau,% \mathbf{x};\check{\bm{\theta}})]=f^{\ast}.

(S7)

Noticed that for $\bm{\theta}\in\bar{\mathbb{R}}^{n}$ , we have

\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{x})}[u(\tau,\mathbf{x}% ;\bm{\theta})]=\mathbb{E}_{p(\mathbf{x})}[\mathbb{E}_{p(\mathbf{z})}[f(\mathrm% {sgn}(\bm{\theta}-(\mathbf{x}+\sqrt{2\tau}\mathbf{z})))]]\leq\mathbb{E}_{p(% \mathbf{x})}[\mathbb{E}_{p(\mathbf{z})}[f^{\ast}]]=f^{\ast},

(S8)

and the equality is true if and only if $u(\tau,\mathbf{x};{\bm{\theta}})=f^{\ast}$ is true for $\mathbf{x}\in\mathbb{R}^{n}$ . Therefore, if $\hat{\bm{\theta}}$ is the one of the minimas of $u(\tau,{\bm{\theta}})$ , and we have $u(\tau,\hat{\bm{\theta}})\geq f^{\ast}$ . Similarly, since

\displaystyle u(\tau,\hat{\bm{\theta}})=\mathbb{E}_{p(\mathbf{x})}[u(\tau,% \mathbf{x};\hat{\bm{\theta}})]=\mathbb{E}_{p(\mathbf{x})}[\mathbb{E}_{p(% \mathbf{z})}[f(\mathrm{sgn}(\hat{\bm{\theta}}-(\mathbf{x}+\sqrt{2\tau}\mathbf{% z})))]]\leq\mathbb{E}_{p(\mathbf{x})}[\mathbb{E}_{p(\mathbf{z})}[f^{\ast}]]=f^% {\ast},

(S9)

we also have $u(\tau,\hat{\bm{\theta}})\leq f^{\ast}$ , hence

\displaystyle u(\tau,\mathbf{x};\hat{\bm{\theta}})=f^{\ast}=u(\tau,\mathbf{x};% {\bm{\theta}}^{\ast}),\quad\mathbf{x}\in\mathbb{R}^{n}.

(S10)

Due to the backward uniqueness of the heat equation, we have

\displaystyle u(0,\mathbf{x};\hat{\bm{\theta}})=u(0,\mathbf{x};{\bm{\theta}}^{% \ast}),\quad\mathbf{x}\in\mathbb{R}^{n},

(S11)

that is

\displaystyle h(\hat{\bm{\theta}})=h({\bm{\theta}}^{\ast})=f^{\ast}.

(S12)

As a result, $\hat{\bm{\theta}}$ is the one of minimas of $h(\bm{\theta})$ . Conversely, using Eq. (S7), it is obviously to see that if $\hat{\bm{\theta}}$ is one of minimas of $h(\bm{\theta})$ , it is also one of minimas of $u(\tau,{\bm{\theta}})$ . ∎

1.2 Proof of Thm. 2

Recall Eq. (S3) and use the definition of $h(\bm{\theta})$ (see Sec. 2 of the main paper), we have

\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{z})}[\mathbb{E}_{p(% \mathbf{s|\bm{\theta}+\sqrt{2\tau}\mathbf{z}})}[f(\mathbf{s})]].

(S13)

To construct a low-variance estimation, instead of directly using the Monte Carlo gradient estimation by sampling from $p(\mathbf{z})$ and $p(\mathbf{s|\bm{\theta}+\sqrt{2\tau}\mathbf{z}})$ for gradient estimation, we manage to integrate out the stochasticity respect to $\mathbf{z}$ . Use the reparameterization Eq. (S2) and Eq. (S3), we have

\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{x})}[\mathbb{E}_{p(% \mathbf{z})}[f(\mathrm{sgn}(\bm{\theta}+\sqrt{2\tau}\mathbf{z}-\mathbf{x}))]].

(S14)

Now we can calculate the inner term $\mathbb{E}_{p(\mathbf{z})}[f(\mathrm{sgn}(\bm{\theta}+\sqrt{2\tau}\mathbf{z}-% \mathbf{x}))]$ . Due to the assumption, the target function $f(\mathbf{s})$ can be written as a $K$ -order multilinear polynomial of $\mathbf{s}$

\displaystyle\begin{aligned} &f(\mathbf{s})=a_{0}+\sum_{i_{1}}a_{1,i_{1}}s_{i_% {1}}+\sum_{i_{1}<i_{2}}a_{2,i_{1}i_{2}}s_{i_{1}}s_{1_{2}}+\sum_{i_{1}<i_{2}<i_% {3}}a_{3,i_{1}i_{2}i_{3}}s_{i_{1}}s_{1_{2}}s_{i_{3}}+\cdots\\ &+\sum_{i_{1}<\cdots<i_{K}}a_{K,i_{1}\ldots i_{K}}s_{i_{1}}\cdots s_{i_{K}}.% \end{aligned}

(S15)

Integrating respect to each dimension of the Gaussian integral, we have [50]

\displaystyle\begin{aligned} &\mathbb{E}_{p(\mathbf{z})}[f(\mathrm{sgn}(\bm{% \theta}+\sqrt{2\tau}\mathbf{z}-\mathbf{x}))]\\ &=a_{0}+\sum_{i_{1}}a_{1,i_{1}}\tilde{s}_{i_{1}}+\sum_{i_{1}<i_{2}}a_{2,i_{1}i% _{2}}\tilde{s}_{i_{1}}\tilde{s}_{1_{2}}+\sum_{i_{1}<i_{2}<i_{3}}a_{3,i_{1}i_{2% }i_{3}}\tilde{s}_{i_{1}}\tilde{s}_{1_{2}}\tilde{s}_{i_{3}}\\ &+\cdots+\sum_{i_{1}<\cdots<i_{K}}a_{K,i_{1}\ldots i_{K}}\tilde{s}_{i_{1}}% \cdots\tilde{s}_{i_{K}},\end{aligned}

(S16)

where

\displaystyle\tilde{s}_{i}=\mathbb{E}_{p(z_{i})}[\mathrm{sgn}(\theta_{i}+\sqrt% {2\tau}z_{i}-x_{i})]=\mathrm{erf}(\frac{\theta_{i}-x_{i}}{\sqrt{\tau}}),

(S17)

where $\mathrm{erf}(\cdot)$ is the error function. Therefore, we have

\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{x})}[f(\mathrm{erf}(% \frac{\bm{\theta}-\mathbf{x}}{\sqrt{\tau}}))],

(S18)

where $\mathrm{erf}(\cdot)$ is the element-wise error function.

1.3 Proof of Thm. 3

Proof.

Define the square loss of $\bm{\theta}$ as

\displaystyle e(\bm{\theta})=(h(\bm{\theta})-h(\bm{\theta}^{\ast}))^{2}.

(S19)

According to the definition of $h(\bm{\theta})$ , we have

\displaystyle e(\bm{\theta})=\mathrm{E}_{p(\mathbf{x})}[f(\mathrm{sgn(\bm{% \theta}-\mathbf{x})})-f(\mathrm{sgn(\bm{\theta}^{\ast}-\mathbf{x})})]^{2}\leq% \mathrm{E}_{p(\mathbf{x})}[(f(\mathrm{sgn(\bm{\theta}-\mathbf{x})})-f(\mathrm{% sgn(\bm{\theta}^{\ast}-\mathbf{x})}))^{2}].

(S20)

Define the error function

\displaystyle r(\tau,\mathbf{x};\bm{\theta})=u(\tau,\mathbf{x};\bm{\theta})-u(% \tau,\mathbf{x};\bm{\theta}^{\ast}).

(S21)

Then the error function satisfies the following heat equation

\displaystyle\left\{\begin{matrix}\partial_{\tau}r(\tau,\mathbf{x};\bm{\theta}% )&=&\bigtriangledown_{\mathbf{x}}r(\tau,\mathbf{x};\bm{\theta})\\ r(0,\mathbf{x};\bm{\theta})&=&f(\mathrm{sgn}(\bm{\theta}-\mathbf{x}))-f(% \mathrm{sgn}(\bm{\theta}^{\ast}-\mathbf{x}))\end{matrix}\right..

(S22)

Define the energy function of the error function $r(\tau,\mathbf{x};\bm{\theta})$ as

\displaystyle E(\tau;\bm{\theta})=\int_{\mathbb{R}^{n}}r^{2}(\tau,\mathbf{x};% \bm{\theta})p(\mathbf{x})d\mathbf{x}.

(S23)

Then applying the heat equation and the integration by parts, we have

\displaystyle\frac{d}{d\tau}E(\tau;\bm{\theta})=-2\int_{\mathbb{R}^{n}}\left\|% \bigtriangledown r(\tau,\mathbf{x};\bm{\theta})\right\|^{2}p(\mathbf{x})d% \mathbf{x}.

(S24)

Hence we have for $0<\tau_{1}<\tau_{2}$

\displaystyle E(\tau_{1};\bm{\theta})=E(\tau_{2};\bm{\theta})+2\int_{\tau_{1}}% ^{\tau_{2}}\int_{\mathbb{R}^{n}}\left\|\bigtriangledown r(\tau,\mathbf{x};\bm{% \theta})\right\|^{2}p(\mathbf{x})d\mathbf{x}d\tau.

(S25)

Use the Harnack’s inequality [22], we have

\displaystyle\left\|\bigtriangledown r(\tau,\mathbf{x};\bm{\theta})\right\|^{2% }\leq r(\tau,\mathbf{x};\bm{\theta})\partial_{\tau}r(\tau,\mathbf{x};\bm{% \theta})+\frac{n}{2\tau}r^{2}(\tau,\mathbf{x};\bm{\theta}),

(S26)

combine with Eq. (S25), we have

\displaystyle E(\tau_{1};\bm{\theta})\leq E(\tau_{2};\bm{\theta})+\frac{n}{2}% \int_{\tau_{1}}^{\tau_{2}}\frac{E(\tau;\bm{\theta})}{\tau}d\tau.

(S27)

Using the Minkowski inequality on the measure $p(\mathbf{x})$ , we have

\displaystyle\begin{aligned} h(\bm{\theta})-h(\bm{\theta}^{\ast})=e^{1/2}(\bm{% \theta})&\leq\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{\theta}-\mathbf{x% }))-u(\tau_{1};\mathbf{x};\bm{\theta}))^{2}p(\mathbf{x})d\mathbf{x}\big{)}^{1/% 2}\\ &+\big{(}\int_{\mathbb{R}^{n}}(u(\tau_{1};\mathbf{x};\bm{\theta})-u(\tau_{1};% \mathbf{x};\bm{\theta}^{\ast}))^{2}p(\mathbf{x})d\mathbf{x}\big{)}^{1/2}\\ &+\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{\theta}^{\ast}-\mathbf{x}))-% u(\tau_{1};\mathbf{x};\bm{\theta}^{\ast}))^{2}d\mathbf{x}\big{)}^{1/2}\\ &=\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{\theta}-\mathbf{x}))-u(\tau_% {1};\mathbf{x};\bm{\theta}))^{2}p(\mathbf{x})d\mathbf{x}\big{)}^{1/2}\\ &+\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{\theta}^{\ast}-\mathbf{x}))-% u(\tau_{1};\mathbf{x};\bm{\theta}^{\ast}))^{2}d\mathbf{x}\big{)}^{1/2}+E^{1/2}% (\tau_{1};\bm{\theta}).\end{aligned}

(S28)

Recall the continuity of the heat equation:

\displaystyle\lim_{\tau\rightarrow 0}\int_{\mathbb{R}^{n}}(u(\tau,\mathbf{x};% \bm{\theta})-f(\mathrm{sgn}(\bm{\theta}-\mathbf{x})))^{2}p(\mathbf{x})d\mathbf% {x}=0.

(S29)

Therefore, given $\epsilon>0$ , there exists a $\tau_{1}>0$ , such that

\displaystyle\begin{aligned} &\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{% \theta}-\mathbf{x}))-u(\tau_{1};\mathbf{x};\bm{\theta}))^{2}p(\mathbf{x})d% \mathbf{x}\big{)}^{1/2}\\ &+\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{\theta}^{\ast}-\mathbf{x}))-% u(\tau_{1};\mathbf{x};\bm{\theta}^{\ast}))^{2}p(\mathbf{x})d\mathbf{x}\big{)}^% {1/2}<\epsilon.\end{aligned}

(S30)

Recall Eq. (S27), we then have the error control for $e(\bm{\theta})$ :

\displaystyle e^{1/2}(\bm{\theta})\leq E^{1/2}(\tau_{1};\bm{\theta})+\epsilon% \leq\big{(}E(\tau_{2};\bm{\theta})+\frac{n}{2}\int_{\tau_{1}}^{\tau_{2}}\frac{% E(\tau;\bm{\theta})}{\tau}d\tau\big{)}^{1/2}+\epsilon.

(S31)

Noticed that

\displaystyle E(\tau;\bm{\theta})\leq(\breve{f}-f^{\ast})\mathbb{E}_{p(\mathbf% {x})}[u(\tau,\mathbf{x};\bm{\theta}^{\ast})-u(\tau,\mathbf{x};\bm{\theta})]=(% \breve{f}-f^{\ast})(u(\tau,\bm{\theta}^{\ast})-u(\tau,\bm{\theta})),

(S32)

where $\breve{f}=\max_{\mathbf{s}}f(\mathbf{s})$ , and we prove the theorem. ∎

2 Complexity analysis

We empirically estimate the time cost per iteration of our HeO for problems of different dimension $n$ in Fig. S1.

3 Parameter sensitivity analysis

We report the performance of our HeO on a wide range of parameter settings in Fig. S2.

4 Relation to evolution strategy

We show that the update in Eq. (8) is equivalent to a variation of the evolution strategy (ES) which is a robust and powerful algorithm for solving black-box optimization problem [51]. In fact, we have

\displaystyle\begin{aligned} \bigtriangledown_{\bm{\theta}}u(\tau,\bm{\theta})% &=\bigtriangledown_{\bm{\theta}}\mathbb{E}_{p(\mathbf{z})}[h(\bm{\theta}+\sqrt% {2\tau}\mathbf{z})]\\ &=\bigtriangledown_{\bm{\theta}}\int\frac{1}{\sqrt{4\pi\tau}^{n}}\exp(-\frac{1% }{4\tau}\left\|\mathbf{z}-\bm{\theta}\right\|^{2})h(\mathbf{z})d\mathbf{z}\\ &=\frac{1}{2\tau}\int\frac{1}{\sqrt{4\pi\tau}^{n}}\exp(-\frac{1}{4\tau}\left\|% \mathbf{z}-\bm{\theta}\right\|^{2})h(\mathbf{z})(\mathbf{z}-\bm{\theta})d% \mathbf{z}\\ &=\frac{1}{\sqrt{2\tau}}\int\frac{1}{\sqrt{2\pi}^{n}}\exp(-\frac{1}{2}\left\|% \mathbf{z}\right\|^{2})h(\bm{\theta}+\sqrt{2\tau}\mathbf{z})\mathbf{z}d\mathbf% {z}\\ &=\frac{1}{\sqrt{2\tau}}\mathbb{E}_{p(\mathbf{z})}[h(\bm{\theta}+\sqrt{2\tau}% \mathbf{z})\mathbf{z}].\end{aligned}

(S33)

The random vector $\mathbf{z}$ corresponds to the stochastic mutation and $h(\bm{\theta}+\sqrt{2\tau}\mathbf{z})$ corresponds to the fitness in Alg. 1 in [63]. In ES, the standard deviation $\sqrt{2\tau}$ is fixed in general, while in our HeO $\sqrt{2\tau_{t}}$ is varying across time. As shown in Thm. 3, the varying $\tau_{t}$ in our HeO offers a theoretical benefit that it controls the upper bound of the optimization result. In contrast, a constant $\tau$ in ES does not provide this benefit. Another difference between our HeO and ES is that we integral out $\mathbf{z}$ and using Eq. (11) to estimate the gradient, while in ES the gradient is estimated based on sampling $\mathbf{z}$ .

5 Relation to denoising diffusion models

Our approach, while bearing similarities to the DDM—a highly regarded and extensively utilized artificial generative model [24] that relies on the reverse diffusion process for data generation—differs in key aspects. The DDM necessitates reversing the diffusion process to generate data that from the target distribution. In contrast, it is unnecessary for our HeO to strictly adhering to the reverse time sequence $\tau_{t}$ in the optimization process, as under different $\tau$ , the function $u(\tau,\bm{\theta})$ shares the same optima with that of the original problem $h(\bm{\theta})$ . This claim is corroborated in Fig. S3. as shown below, where HeO applying non-monotonic schedules of $\tau_{t}$ still demonstrates superior performance. Hence, it is possible to explore diverse $\tau_{t}$ schedules to further performance enhancement.

6 Cooperative optimization

Our study suggests that HeO exhibits a distinct cooperative optimization mechanism, setting it apart from current methodologies. Specifically, HeO benefits from the fact that target functions $u(\tau,\bm{\theta})$ share the same optima as the original problem $h(\bm{\theta})$ for any $\tau>0$ . This characteristic allows the solver to transition between different $\tau$ values during the optimization process, eliminating the necessity for a monotonic $\tau_{t}$ schedule. In contrast, traditional methods such as those based on quantum adiabatic theory or bifurcation theory require a linear increase of a control parameter $a_{t}$ from 0 to 1. This parameter is analogous to $\tau_{t}$ in the HeO framework.

To empirically verify the above claim, we introduce a random perturbation to the $\tau$ schedule in Alg. 1, rendering it non-monotonic: $\tilde{\tau}_{t}=c^{2}_{t}\tau_{t}$ , where $c_{t}$ is uniformly distributed on $[1-\delta,1+\delta]$ with $\delta$ controlling the amplitude of the perturbation. For other methods based on quantum adiabatic theory or bifurcation theory, we correspondingly introduce the perturbation as $\tilde{a}_{t}=\mathrm{clamp}(c_{t}a_{t},0,1)$ . If an algorithm contains cooperative optimization mechanism, it still works well even when the control parameter is not monotonic, as optimizing the transformed problems under different control parameters cooperatively contributes to optimizing the original problem. As shown in Fig. S3, the performance of other methods are all dramatically deteriorated. In contrast, HeO shows no substantial decline in performance, corroborating that HeO employs a cooperative optimization mechanism.

7 Implementation details

All the experiments are conducted on a single NVIDIA RTX-3090 GPU (24GB) and an Intel(R) Xeon(R) Gold 6226R CPU (2.90GHz). For convenience, we denote $\sqrt{\tau}_{t}=\sigma_{t}$ .

Monte Carlo gradient estimation for combinatorial optimization. Based on Eq. (5), we construct a combinatorial optimization algorithm using the Monte Carlo gradient estimation (MCGE), as shown in Alg. 2. Noticed that we clamp the parameter $\bm{\theta}_{t}$ in the $[0,1]$ for numerical stability; additionally, we binarize the $\bm{\theta}_{T}$ to obtain the optimized binary configuration $\mathbf{s}_{T}$ in the end.

Algorithm 2 Monte Carlo gradient estimation for combinatorial optimization (MCGE)

Input: target function

f(\cdot)

, step size

\gamma

, sample number

M

, iteration number

T

initialize elements of

\bm{\theta}_{0}

0.5

for

t=0

T-1

sample

\mathbf{s}^{(m)}\sim_{i.i.d.}p(\mathbf{s}|\bm{\theta}_{t}),\quad m=1,\ldots,M

\mathbf{g}_{t}\leftarrow\frac{1}{M}\sum_{m=1}^{M}f(\mathbf{s}^{(m)})% \bigtriangledown_{\bm{\theta}_{t}}\log p(\mathbf{s}^{(m)}|\bm{\theta}_{t})

\bm{\theta}_{t+1}\leftarrow\mathrm{clamp}\big{(}\bm{\theta}_{t}-\gamma\mathbf{% g}_{t},0,1\big{)}

end for

\mathbf{s}_{T}\leftarrow\mathrm{sgn}(\bm{\theta}_{T}-0.5)

Output: binary configuration

\mathbf{s}_{T}

Gradient descent with momentum. We provide HeO with momentum in Alg. 3.

Algorithm 3 Heat diffusion optimization (HeO) with momentum

Input: target function

f(\cdot)

, step size

\gamma

, momentum

\kappa

\sigma

schedule

\{\sigma_{t}\}

, iteration number

T

, set

\mathbf{g}_{-1}=\mathbf{0}

initialize elements of

\bm{\theta}_{0}

0.5

for

t=0

T-1

sample

\mathbf{x}_{t}\sim\mathrm{Unif}[0,1]^{n}

\mathbf{w}_{t}\leftarrow\bigtriangledown_{\bm{\theta}_{t}}f(\mathrm{erf}(\frac% {\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}}))

\mathbf{g}_{t}\leftarrow\kappa\mathbf{g}_{t-1}+\gamma\mathbf{w}_{t}

\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}+% \mathbf{g}_{t}\big{)}

end for

\mathbf{s}_{T}\leftarrow\mathrm{sgn}(\bm{\theta}_{T}-0.5)

Output: binary configuration

\mathbf{s}_{T}

Toy example. We set the momentum $\kappa=0.9999$ , learning rate $\gamma=2$ and iteration number $T=5000$ for HeO (Alg. 3). For MCGE without momentum, we set $T=50000$ , we set $\gamma=$ 1e-6, momentum $\kappa=0$ , and $M=10$ . For MCGE with momentum, we set momentum as $0.9$ .

Max-cut problem. For solving the max-cut problems from the Biq Mac Library [36], we set the steps $T=5000$ for all the algorithms. For HeO, we set $\gamma=2$ and $\sigma_{t}$ linearly decreases from $1$ to $0$ for HeO, and we set momentum as zero. For LQA and SIM-CIM, we use the setting in [10]. For bSB, dSB, aSB, and CIM, we apply the settings in [64]. To reduce the fluctuations of the results, for each algorithm $\mathrm{alg}$ and instant $i$ , the relative loss is calculated as $|C^{i,\mathrm{alg}}-C^{i}_{\min}|/|C_{\min}^{i}|$ , where $C^{i,\mathrm{alg}}$ is the lowest output of the algorithm $\mathrm{alg}$ on the instance $i$ over 10 tries, and $C^{i}_{\min}$ is the lowest output of all 7 the algorithm on the instance $i$ . For each test, we estimate the mean and std from 10 runs.

3-SAT problem. For Boolean 3-satisfiability (3-SAT) problem, we set the momentum $\kappa=0.9999$ , $T=5000$ , $\gamma=2$ , and $\sigma_{t}$ linearly decreases from $\sqrt{2}$ to $0$ for HeO. According to the empirical finding that high-order loss function leads to better results [16], we include higher order terms in the target function. However, since

\displaystyle(\frac{1-c_{i}s_{i}}{2})^{k}=\frac{1-c_{i}s_{i}}{2}

(S34)

for any $s_{i},c_{i}\in\{-1,1\}$ , directly introducing higher order term in the $f(\mathbf{s})$ is useless. Instead, we adjust the gradient from

\displaystyle\bigtriangledown_{\bm{\theta}_{t}}f(\mathrm{erf}(\frac{\bm{\theta% }_{t}-\mathbf{x}_{t}}{\sigma_{t}})))=\bigtriangledown_{\bm{\theta}_{t}}[\sum_{% h=1}^{H}\prod_{i=1}^{3}\frac{1}{2}(1-c_{h_{i}}\big{(}\mathrm{erf}(\frac{\bm{% \theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}})\big{)}_{h_{i}})],

(S35)

\displaystyle\bigtriangledown_{\bm{\theta}_{t}}[\sum_{h=1}^{H}\prod_{i=1}^{3}(% \frac{1}{2}(1-c_{h_{i}}\big{(}\mathrm{erf}(\frac{\bm{\theta}_{t}-\mathbf{x}_{t% }}{\sigma_{t}})\big{)}_{h_{i}}))^{4}].

(S36)

We consider the 3-SAT problems with various number of variables from the SATLIB [37]. For each number of variables in the dataset, we consider the first 100 instance. We apply the same configuration of that in [16] for both 3-order solver and 2-order oscillation Ising machine solver. The energy gap of the 2-order solver is set as 1. For each test, we estimate the mean and std from 100 runs.

Algorithm 4 Heat diffusion optimization (HeO) for 3-SAT problem

Input: adjusted target function

f(\mathbf{s})=\sum_{h=1}^{H}\prod_{i=1}^{3}\big{(}\frac{1-c_{h_{i}}s_{h_{i}}}{% 2}\big{)}^{4}

(Eq. (S36)), step size

\gamma

, momentum

\kappa

\sigma

schedule

\{\sigma_{t}\}

, iteration number

T

initialize elements of

\bm{\theta}_{0}

0.5

, set

\mathbf{g}_{-1}=\mathbf{0}

for

t=0

T-1

sample

\mathbf{x}_{t}\sim\mathrm{Unif}[0,1]^{n}

\mathbf{w}_{t}\leftarrow\bigtriangledown_{\bm{\theta}_{t}}f(\mathrm{erf}(\frac% {\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}}))

\mathbf{g}_{t}\leftarrow\kappa\mathbf{g}_{t-1}+\gamma\mathbf{w}_{t}

\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}-% \gamma\mathbf{g}_{t}\big{)}

end for

\mathbf{s}_{T}\leftarrow\mathrm{sgn}(\bm{\theta}_{T}-0.5)

Output: binary configuration

\mathbf{s}_{T}

Ternary-value neural network learning. We represent a ternary variable $s_{\mathrm{t}}\in\{-1,0,1\}$ as $s_{\mathrm{t}}=\frac{1}{2}(s_{\mathrm{b},1}+s_{\mathrm{b},2})$ with two bits $s_{\mathrm{b},1},s_{\mathrm{b},2}\in\{-1,1\}$ . In this way, each element of $W$ can be represented as a function of $\mathbf{s}\in\mathbb{R}^{m\times n\times 2}$ . We denote this relation as a matrix-value function $W=W(\mathbf{s})$

\displaystyle W_{ij}(\mathbf{s})=\frac{1}{2}(s_{ij,1}+s_{ij,2}),\quad i=1,% \ldots,m,\quad,j=1,\ldots,n,

(S37)

Based on the above encoding procedure, we design the training algorithm for based on HeO in Alg. 5. The input $\mathbf{v}$ of the dataset $\mathcal{D}$ is generated from the uniform distribution on $\{-1,0,1\}^{n}$ . For HeO, we set $T=10000$ , $\gamma=0.5$ , $\kappa=0.999$ , and $\sigma_{t}$ linearly decreasing from $\sqrt{2}$ to $0$ . For MCGE, we set $T=10000$ , $\gamma=1e-7$ , $M=10$ , and $\kappa=0.9999$ . We empirically find that MCGE need high sampling number ( $M$ ) and low learning rate ( $\gamma$ ) for stability, while this is not the case for HeO. For each test, we estimate the mean and std from 10 runs.

Algorithm 5 Heat diffusion optimization (HeO) for training ternary-value neural network

Input: dataset

\mathcal{D}

, step size

\gamma

, momentum

\kappa

\sigma

schedule

\{\sigma_{t}\}

, iteration number

T

initialize elements of

\bm{\theta}_{0}

1

, initialize elements of

\tilde{\bm{\beta}}_{0}

0

, set

\mathbf{g}_{-1}=\mathbf{0}

for

t=0

T-1

sample

\mathbf{x}_{t}\sim\mathrm{Unif}[0,1]^{n}

W_{t}\leftarrow W(\mathrm{erf}(\frac{\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t% }}))

(Eq. (S37))

\mathrm{MSE}\leftarrow\frac{1}{\left|\mathcal{D}\right|}\sum_{(\mathbf{v},% \mathbf{y})\in\mathcal{D}}\left\|\bm{\Gamma}(\mathbf{v};W_{t})-\mathbf{y}% \right\|^{2}

\mathbf{w}_{t}\leftarrow\bigtriangledown_{\bm{\theta}_{t}}\mathrm{MSE}

\mathbf{g}_{t}\leftarrow\kappa\mathbf{g}_{t-1}+\gamma\mathbf{w}_{t}

\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}-% \mathbf{g}_{t}\big{)}

end for

\mathbf{s}_{T}=\mathrm{sgn}({\theta}_{T}-0.5)

Output:

W_{T}=W(\mathbf{s}_{T})

Variable selection problem. We construct an algorithm for variable selection problem based on HeO as shown in Alg. 6, where the function $f(\mathbf{s},\bm{\beta})$ is defined in Eq. (17).

Algorithm 6 Heat diffusion optimization (HeO) for linear regression variable selection

Input: dataset

\mathcal{D}

, step size

\gamma

, momentum

\kappa

\sigma

schedule

\{\sigma_{t}\}

, iteration number

T

initialize elements of

\bm{\theta}_{0}

1

, initialize elements of

\tilde{\bm{\beta}}_{0}

0

, set

\mathbf{g}^{\beta}_{-1}=\mathbf{g}^{\theta}_{-1}=\mathbf{0}

for

t=0

T-1

sample

\mathbf{x}^{\theta}_{t}\sim\mathrm{Unif}[0,1]^{n}

\mathbf{w}^{\beta}_{t}\leftarrow\bigtriangledown_{\tilde{\bm{\beta}}_{t}}f(% \mathrm{erf}(\frac{\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}}),\bm{\beta})

\mathbf{g}^{\beta}_{t}\leftarrow\kappa\mathbf{g}^{\beta}_{t-1}+\frac{\gamma}{T% }\mathbf{w}^{\beta}_{t}

\tilde{\bm{\beta}}_{t+1}\leftarrow\tilde{\bm{\beta}}_{t}-\mathbf{g}^{\beta}_{t}

\mathbf{w}^{\theta}_{t}\leftarrow\bigtriangledown_{\bm{\theta}_{t}}f(\mathrm{% erf}(\frac{\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}}),\bm{\beta})

\mathbf{g}^{\theta}_{t}\leftarrow\kappa\mathbf{g}^{\theta}_{t-1}+\gamma\mathbf% {w}^{\theta}_{t}

\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}-% \gamma\mathbf{g}^{\theta}_{t}\big{)}

end for

\mathbf{s}_{T}\leftarrow\mathrm{sgn}(\bm{\theta}_{T}-0.5)

Output:

\mathbf{s}_{T}

We randomly generate $400$ -dimensional datasets with $1000$ training samples. The input $\mathbf{v}$ is sampled from a standard Gaussian distribution. The element of the ground-truth coefficient $\bm{\beta}^{\ast}$ is uniformly distributed on $[-2,-1]\cup[1,2]$ , and each element haves $1-q$ probability of being set as zero and thus should be ignored for the prediction. We apply a five-fold cross-validation for all of methods. For our HeO, we set $T=2000$ and $\gamma=1$ , $\kappa=0.999$ . We generate an ensemble of indicators $\mathbf{s}$ of size 100. For each $\mathbf{s}$ in the ensemble, we fit a linear model by implementing an OLS on the non-zero variables indicated by $\mathbf{s}$ and calculate the average MSE loss of the linear model on the cross-validation sets. We then select the linear model with lowest MSE on the validate sets as the output linear model. For Lasso and $L_{0.5}$ regression, we follow the implementation in [41] with 10 iterations. the regularization parameter is selected by cross-validation from $\{0.05,0.1,0.2,0.5,1,2,5\}$ . For each test, we estimate the mean and std from 10 runs.

Table S2: The attributes of the real world graphs and the parameter settings of HeO.

graph name	$\|V\|$	$\|E\|$	$T$	$\gamma$	$\lambda_{t}$	$\sigma_{t}$
tech-RL-caida	190914	607610	200	2.5	linearly from 0 to 2.5	linearly from $\sqrt{2}$ to 0
soc-youtube	495957	1936748	200	2.5	linearly from 0 to 2.5
inf-roadNet-PA	1087562	1541514	200	2.5	linearly from 0 to 7.5
inf-roadNet-CA	1957027	2760388	200	5	linearly from 0 to 7.5
socfb-B-anon	2937612	20959854	50	2.5	linearly from 0 to 5
socfb-A-anon	3097165	23667394	50	2.5	linearly from 0 to 5
socfb-uci-uni	58790782	92208195	50	2.5	linearly from 0 to 5

Minimum vertex cover problem. For constrained binary optimization

	$\displaystyle\min_{\mathbf{s}\in\{\-1,1\}^{n}}f(\mathbf{s}),$		(S38)
	$\displaystyle g_{k}(\mathbf{s})\leq 0,\quad,k=1,\ldots,K,$		(S39)

we put the constrains as the penalty function with coefficient $\lambda$ into the target function

\displaystyle f_{\lambda}(\mathbf{s})=f(\mathbf{s})-\lambda\sum_{k=1}^{K}g_{k}% (\mathbf{s}),

(S40)

and the corresponding algorithm is shown in Alg. 7.

Algorithm 7 Heat diffusion optimization (HeO) for constrained binary optimization

Input: target function with penalty

f_{\lambda}

, step size

\gamma

\sigma

schedule

\{\sigma_{t}\}

, penalty coefficients schedule

\{{\lambda}_{t}\}

, iteration number

T

initialize elements of

\bm{\theta}_{0}

0.5

, set

\mathbf{g}_{-1}=\mathbf{0}

for

t=0

T-1

sample

\mathbf{x}_{t}

from

\mathrm{Unif}[0,1]^{n}

\mathbf{w}_{t}\leftarrow\bigtriangledown_{{\bm{\theta}}_{t}}f_{\lambda_{t}}(% \frac{\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}})

\mathbf{g}_{t}\leftarrow\kappa\mathbf{g}_{t-1}+\gamma\mathbf{w}_{t}

\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}-% \gamma\mathbf{g}_{t}\big{)}

end for

\mathbf{s}_{T}\leftarrow\mathrm{sgn}(\bm{\theta}_{T}-0.5)

Output:

\mathbf{s}_{T}

Algorithm 8 Refinement of the result of MVC

Input: the result of HeO

\mathbf{s}_{T}

for

i=1

n

set

s_{T,i}

0

\mathbf{s}_{T}

is still a vertex cover

end for

Output:

\mathbf{s}_{T}

We implement the HeO on a single NVIDIA RTX 3090 GPU for all the minimum vertex cover (MVC) experiments. Let $\mathbf{s}$ be the configuration to be optimized, in which $s_{i}$ is $1$ if we select $i$ -th vertex into $\mathcal{V}_{c}$ , otherwise we do not select $i$ -vertex into $\mathcal{V}_{c}$ . The target function to be minimize is the size of $\mathcal{V}_{c}$ : $f(\mathbf{s})=\sum_{i=1}^{n}\frac{s_{i}+1}{2}$ , and the constrains are

\displaystyle g_{ij}(\mathbf{s})=(1-\frac{s_{i}+1}{2})(1-\frac{s_{j}+1}{2})=0,% \quad\forall i,j,e_{ij}\in\mathcal{E},

(S41)

where $e_{ij}$ represent the edge connecting the $i$ and $j$ -th vertices. We construct the target function $f_{\lambda}(\mathbf{s})=f(\mathbf{s})+\lambda\sum_{e_{ij}\in\mathcal{E}}g_{ij}% (\mathbf{s})$ . The term with the positive factor $\lambda$ penalizes vector $\mathbf{s}$ when there are uncovered edges. After the HeO outputs the result $\mathbf{s}_{T}$ , we empirically find that its subset may also form a vertex cover for the graph $\mathcal{G}$ , so we implement the following refinement on the result $\mathbf{s}_{T}$ , as shown in Alg. 8. We report the vertex number, edge number and settings of HeO in Tab. S2. For FastVC, we follow the settings in [45] and use its codebase, and set the cut-off time as the same as the time cost of HeO. For each test, we estimate the mean and std from 10 runs.