Efficient Combinatorial Optimization via Heat Diffusion
Hengyuan Ma1, Wenlian Lu1,2,3,4,5,6, Jianfeng Feng1,2,3,4,7\ast


1 Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai 200433, China
2 Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
3 School of Mathematical Sciences, Fudan University, No. 220 Handan Road, Shanghai, 200433, Shanghai, China
4 Shanghai Center for Mathematical Sciences, No. 220 Handan Road, Shanghai, 200433, Shanghai, China
5 Shanghai Key Laboratory for Contemporary Applied Mathematics, No. 220 Handan Road, Shanghai, 200433, Shanghai, China
6 Key Laboratory of Mathematics for Nonlinear Science, No. 220 Handan Road, Shanghai, 200433, Shanghai, China
7 Department of Computer Science, University of Warwick, Coventry, CV4 7AL, UK
\ast [email protected]

Abstract

Combinatorial optimization problems are widespread but inherently challenging due to their discrete nature. The primary limitation of existing methods is that they can only access a small fraction of the solution space at each iteration, resulting in limited efficiency for searching the global optimal. To overcome this challenge, diverging from conventional efforts of expanding the solver’s search scope, we focus on enabling information to actively propagate to the solver through heat diffusion. By transforming the target function while preserving its optima, heat diffusion facilitates information flow from distant regions to the solver, providing more efficient navigation. Utilizing heat diffusion, we propose a framework for solving general combinatorial optimization problems. The proposed methodology demonstrates superior performance across a range of the most challenging and widely encountered combinatorial optimizations. Echoing recent advancements in harnessing thermodynamics for generative artificial intelligence, our study further reveals its significant potential in advancing combinatorial optimization. The codebase of our study is available in https://github.com/AwakerMhy/HeO.

1 Introduction

Combinatorial optimization problems are prevalent in various applications, encompassing circuit design [1], machine learning [2], computer vision [3], molecular dynamics simulation [4], traffic flow optimization [5], and financial risk analysis [6]. This widespread application creates a significant demand for accelerated solutions to these problems. Alongside classical algorithms, which encompass both exact solvers and metaheuristics [7], recent years have seen remarkable advancements in addressing combinatorial optimization. These include quantum adiabatic approaches [8, 9, 10], simulated bifurcation [11, 12, 13], coherent Ising machine [14, 15], high-order Ising machine [16], and deep learning techniques [17, 18]. However, due to the exponential growth of the solution number, finding the optima within a limited computational budget remains a daunting challenge.

Our primary focus is on iterative approximation solvers, which constitute a significant class of combinatorial optimization methods. An iterative approximation solvers typically begin with an initial solution and iteratively improve it by finding better solutions within the neighborhood of the current solution, known as the search scope or more vividly, receptive field. However, due to combinatorial explosion, as the scope of the receptive field increases, the number of solutions to be assessed grows exponentially, making a thorough evaluation of all these solutions expensive. As a result, current approaches are limited to a narrow receptive field, rendering them blind to distant regions in the solution space and heightening the risk of getting trapped in local minimas or areas with bumpy landscapes. Although methods like large neighborhood search [19], variable neighborhood search [20] and path auxiliary sampling [21] are designed to broaden the search scope, they can only gather a modest increment of information from the expanded search scope. Consequently, the current solvers’ receptive field remains significantly constrained, impeding their efficiency.

Refer to caption
Figure 1: The Heat diffusion optimization (HeO) framework. The efficiency of searching a key in a dark room is significantly improved by employing navigation that utilizes heat emission from the key. In combinatorial optimization, heat diffusion transforms the target function into different versions while preserving the location of the optima. Therefore, the gradient information of these transformed functions cooperatively help to optimize the original target function.

In this study, we approach the prevalent limitation stated above from a unique perspective. Instead of expanding the solver’s receptive field to acquire more information from the solution space, we concentrate on propagating information from distant areas of the solution space to the solver via heat diffusion [22]. To illustrate, imagine a solver searching for the optima in the solution space akin to a person searching for a key in a dark room, as depicted in Fig. 1. Without light, the person is compelled to rely solely on touching his surrounding space. The tactile provides only localized information, leading to inefficient navigation. This mirrors the current situation in combinatorial optimization, wherein the receptive field is predominantly confined to local information. However, if the key were to emit heat, its radiating warmth would be perceptible from a distance, acting as a directional beacon. This would significantly enhance navigational efficiency for finding the key.

Motivated by the above metaphor, we propose a simple but efficient framework utilizing heat diffusion to solve various combinatorial optimization problems. Heat diffusion transforms the target function into different versions, within which the information from distant regions actively flow toward the solver. Crucially, the backward uniqueness of the heat equation [23] guarantees that the original problem’s optima are unchanged under these transformations. Therefore, information of target functions under different heat diffusion transformations can be cooperatively employed for optimize the original problem (Fig. 1). Empirically, our framework demonstrates superior performance compared to advanced algorithms across a diverse range of combinatorial optimization instances, spanning quadratic to polynomial, binary to ternary, unconstrained to constrained, and discrete to mixed-variable scenarios. Mirroring the recent breakthroughs in generative artificial intelligence through diffusion processes [24], our research further reveals the potential of heat diffusion, a related thermodynamic phenomenon, in enhancing combinatorial optimization.

2 Failure of gradient-based combinatorial optimization

We will first formulate the general combinatorial optimization problem and then reframe it in a gradient descent-based manner. This reformulation allows us to utilize heat diffusion later. Various combinatorial optimization problems can be naturally formalized as a pseudo-Boolean optimization (PBO) problem [25], in which we aim to find the minima of a real-value target function fn𝑓superscript𝑛maps-tof\in\mathbb{R}^{n}\mapsto\mathbb{R}italic_f ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ↦ blackboard_R subjecting to a binary constraints

min𝐬{1,1}nf(𝐬),subscript𝐬superscript11𝑛𝑓𝐬\displaystyle\min_{\mathbf{s}\in\{-1,1\}^{n}}f(\mathbf{s}),roman_min start_POSTSUBSCRIPT bold_s ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( bold_s ) , (1)

where 𝐬𝐬\mathbf{s}bold_s is binary configuration (bits), and f()𝑓f(\cdot)italic_f ( ⋅ ) is the target function. Through the transformation (𝐬+1)/2𝐬12(\mathbf{s}+1)/2( bold_s + 1 ) / 2, our definition aligns with that in [26], where elements of 𝐬𝐬\mathbf{s}bold_s take 00 or 1111. Given the advanced development of gradient-based algorithms, we are interested in converting the discrete optimization problem into a differentiable one, thereby enabling gradient descent. To achieve this purpose, we encode the bits sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n as independent Bernoulli variables p(si=±1|𝜽)=0.5±(θi0.5)𝑝subscript𝑠𝑖plus-or-minusconditional1𝜽plus-or-minus0.5subscript𝜃𝑖0.5p(s_{i}=\pm 1|\bm{\theta})=0.5\pm(\theta_{i}-0.5)italic_p ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ± 1 | bold_italic_θ ) = 0.5 ± ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 0.5 ) with θi[0,1]subscript𝜃𝑖01\theta_{i}\in[0,1]italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. In this way, we convert the original combinatorial optimization problem into

min𝜽h(𝜽),subscript𝜽𝜽\displaystyle\min_{\bm{\theta}\in\mathcal{I}}h(\bm{\theta}),roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ caligraphic_I end_POSTSUBSCRIPT italic_h ( bold_italic_θ ) , (2)

where :=[0,1]nassignsuperscript01𝑛\mathcal{I}:=[0,1]^{n}caligraphic_I := [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and h(𝜽)=𝔼p(𝐬|𝜽)[f(𝐬)]𝜽subscript𝔼𝑝conditional𝐬𝜽delimited-[]𝑓𝐬h(\bm{\theta})=\mathbb{E}_{p(\mathbf{s}|\bm{\theta})}[f(\mathbf{s})]italic_h ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_s | bold_italic_θ ) end_POSTSUBSCRIPT [ italic_f ( bold_s ) ]. The minima 𝜽superscript𝜽\bm{\theta}^{\ast}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of Eq. (2) is 𝜽=0.5(sgn(𝐬)+1)superscript𝜽0.5sgnsuperscript𝐬1\bm{\theta}^{\ast}=0.5(\mathrm{sgn}(\mathbf{s}^{\ast})+1)bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.5 ( roman_sgn ( bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + 1 ), given that 𝐬superscript𝐬\mathbf{s}^{\ast}bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a minima of the original problem Eq. (1). Here, sgn()sgn\mathrm{sgn}(\cdot)roman_sgn ( ⋅ ) is the element-wise sign function. Now Eq. (2) can be solved through gradient descent starting from a given initial 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

𝜽t+1=𝜽tγ𝜽h(𝜽t),t=1,,T,formulae-sequencesubscript𝜽𝑡1subscript𝜽subscript𝜽𝑡𝛾subscript𝜽𝑡𝑡1𝑇\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\gamma\bigtriangledown_{\bm{% \theta}}h(\bm{\theta}_{t}),\quad t=1,\ldots,T,bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_h ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_t = 1 , … , italic_T , (3)

where γ𝛾\gammaitalic_γ is the learning rate and T𝑇Titalic_T is the iteration number. Unfortunately, this yields a probability distribution p(𝐬|𝜽)𝑝conditional𝐬𝜽p(\mathbf{s}|\bm{\theta})italic_p ( bold_s | bold_italic_θ ) over the configuration space {1,1}nsuperscript11𝑛\{-1,1\}^{n}{ - 1 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, instead of a deterministic binary configuration 𝐬𝐬\mathbf{s}bold_s as desired. Although we can manually binarize 𝜽𝜽\bm{\theta}bold_italic_θ through 𝐁(𝜽):=sgn(𝜽0.5)assign𝐁𝜽sgn𝜽0.5\mathbf{B}(\bm{\theta}):=\mathrm{sgn}(\bm{\theta}-0.5)bold_B ( bold_italic_θ ) := roman_sgn ( bold_italic_θ - 0.5 ) to get the binary configuration which maximizes probability p(𝐬|𝜽)𝑝conditional𝐬𝜽p(\mathbf{s}|\bm{\theta})italic_p ( bold_s | bold_italic_θ ), the outcome f(𝐁(𝜽))𝑓𝐁𝜽f(\mathbf{B}(\bm{\theta}))italic_f ( bold_B ( bold_italic_θ ) ) may be much higher than h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ), resulting in significant performance degradation [27]. This suggests that a good gradient-based optimizer should efficiently diminish the uncertainty in the output distribution p(𝐬|𝜽)𝑝conditional𝐬𝜽p(\mathbf{s}|\bm{\theta})italic_p ( bold_s | bold_italic_θ ), which can be measured by its total variance

V(𝜽)=i=1nθi(1θi).𝑉𝜽superscriptsubscript𝑖1𝑛subscript𝜃𝑖1subscript𝜃𝑖\displaystyle V(\bm{\theta})=\sum_{i=1}^{n}\theta_{i}(1-\theta_{i}).italic_V ( bold_italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (4)

2.1 Monte Carlo gradient estimation

Conventionally, we can solve the problem Eq. (2) by approximating the gradient of h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ) via Monte Carlo gradient estimation (MCGE) [28] (Alg. 2, Appendix), in which we estimate the gradient in Eq. (3) as

𝜽h(𝜽)=𝔼p(𝐬|𝜽)[f(𝐬)𝜽logp(𝐬|𝜽)]1Mm=1Mf(𝐬(m))𝜽logp(𝐬(m)|𝜽),subscript𝜽𝜽subscript𝔼𝑝conditional𝐬𝜽delimited-[]subscript𝜽𝑓𝐬𝑝conditional𝐬𝜽subscript𝜽1𝑀superscriptsubscript𝑚1𝑀𝑓superscript𝐬𝑚𝑝conditionalsuperscript𝐬𝑚𝜽\displaystyle\begin{aligned} \bigtriangledown_{\bm{\theta}}h(\bm{\theta})=% \mathbb{E}_{p(\mathbf{s}|\bm{\theta})}[f(\mathbf{s})\bigtriangledown_{\bm{% \theta}}\log p(\mathbf{s}|\bm{\theta})]\approx\frac{1}{M}\sum_{m=1}^{M}f(% \mathbf{s}^{(m)})\bigtriangledown_{\bm{\theta}}\log p(\mathbf{s}^{(m)}|\bm{% \theta}),\end{aligned}start_ROW start_CELL ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_h ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_s | bold_italic_θ ) end_POSTSUBSCRIPT [ italic_f ( bold_s ) ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( bold_s | bold_italic_θ ) ] ≈ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_f ( bold_s start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( bold_s start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | bold_italic_θ ) , end_CELL end_ROW (5)

where 𝐬(m)i.i.d.p(𝐬|𝜽),m=1,,Mformulae-sequencesubscriptsimilar-toformulae-sequence𝑖𝑖𝑑superscript𝐬𝑚𝑝conditional𝐬𝜽𝑚1𝑀\mathbf{s}^{(m)}\sim_{i.i.d.}p(\mathbf{s}|\bm{\theta}),\quad m=1,\ldots,Mbold_s start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∼ start_POSTSUBSCRIPT italic_i . italic_i . italic_d . end_POSTSUBSCRIPT italic_p ( bold_s | bold_italic_θ ) , italic_m = 1 , … , italic_M. However, it turns out that MCGE performs poorly even equipped with momentum, compared to existing solvers such as simulated annealing and Hopfield neural network, as shown in Fig. 2. We interpret this result as follows. Although MCGE turns the combinatorial optimization problem into a differentiable one, it does not reduce any inherent complexity of the original problem, which may contains a convoluted landscape. As gradient only provides local information, MCGE is susceptible to be trapped in local minimas.

3 Heat diffusion optimization

The inferior performance of the MCGE is attributed to the narrow receptive field of the gradient descent. To overcome this drawback, we manage to provide more efficient navigation to the solver by employing heat diffusion [29], which propagates information from distant region to the solver. Intuitively, consider the parameter space as a thermodynamic system, where each parameter 𝜽𝜽\bm{\theta}bold_italic_θ is referred to as a location and is associated with an initial temperature value h(𝜽)𝜽-h(\bm{\theta})- italic_h ( bold_italic_θ ), as shown in Fig. 1. Then the optimization procedure can be described as the process that the solver is walking around the parameter space to find the location 𝜽superscript𝜽\bm{\theta}^{\ast}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with the highest temperature (or equivalently, the global minima of h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ )). As time progresses, heat flows obeying the Newton’s law of cooling, leading to an evolution of the temperature distribution across time. The heat at 𝜽superscript𝜽\bm{\theta}^{\ast}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT flows towards surrounding areas, ultimately reaching the solver’s location. This provides valuable information for the solver, as it can trace the direction of heat flow to locate the 𝜽superscript𝜽\bm{\theta}^{\ast}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

3.1 Heat diffusion on the parameter space

Now we introduce heat diffusion for combinatorial optimization. We extent the parameter space of 𝜽𝜽\bm{\theta}bold_italic_θ from [0,1]nsuperscript01𝑛[0,1]^{n}[ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to ¯nsuperscript¯𝑛\bar{\mathbb{R}}^{n}over¯ start_ARG blackboard_R end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with ¯={,+}¯\bar{\mathbb{R}}=\mathbb{R}\cup\{-\infty,+\infty\}over¯ start_ARG blackboard_R end_ARG = blackboard_R ∪ { - ∞ , + ∞ }. To keep the probabilistic distribution p(𝐬|𝜽)𝑝conditional𝐬𝜽p(\mathbf{s}|\bm{\theta})italic_p ( bold_s | bold_italic_θ ) meaningful for 𝜽[0,1]n𝜽superscript01𝑛\bm{\theta}\notin[0,1]^{n}bold_italic_θ ∉ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we now redefine p(si=±1|𝜽)=clamp(0.5±(θi0.5),0,1)𝑝subscript𝑠𝑖plus-or-minusconditional1𝜽clampplus-or-minus0.5subscript𝜃𝑖0.501p(s_{i}=\pm 1|\bm{\theta})=\mathrm{clamp}(0.5\pm(\theta_{i}-0.5),0,1)italic_p ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ± 1 | bold_italic_θ ) = roman_clamp ( 0.5 ± ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 0.5 ) , 0 , 1 ), where the clamp function is calculated as clamp(x,0,1)=max(0,min(x,1))clamp𝑥010𝑥1\mathrm{clamp}(x,0,1)=\max(0,\min(x,1))roman_clamp ( italic_x , 0 , 1 ) = roman_max ( 0 , roman_min ( italic_x , 1 ) ). Denote the temperature at location 𝜽𝜽\bm{\theta}bold_italic_θ and time τ𝜏\tauitalic_τ as u(τ,𝜽)𝑢𝜏𝜽u(\tau,\bm{\theta})italic_u ( italic_τ , bold_italic_θ ), which is the solution to the following unbounded heat equation [29]

{τu(τ,𝜽)=Δ𝜽u(τ,𝜽),τ>0,𝜽nu(τ,𝜽)=h(𝜽),τ=0,𝜽n,\displaystyle\left\{\begin{matrix}\partial_{\tau}u(\tau,\bm{\theta})&=&\Delta_% {\bm{\theta}}u(\tau,\bm{\theta}),&\quad\tau>0,&\quad\bm{\theta}\in\mathbb{R}^{% n}\\ u(\tau,\bm{\theta})&=&h(\bm{\theta}),&\quad\tau=0,&\quad\bm{\theta}\in\mathbb{% R}^{n}\\ \end{matrix}\right.,{ start_ARG start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_u ( italic_τ , bold_italic_θ ) end_CELL start_CELL = end_CELL start_CELL roman_Δ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_u ( italic_τ , bold_italic_θ ) , end_CELL start_CELL italic_τ > 0 , end_CELL start_CELL bold_italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_u ( italic_τ , bold_italic_θ ) end_CELL start_CELL = end_CELL start_CELL italic_h ( bold_italic_θ ) , end_CELL start_CELL italic_τ = 0 , end_CELL start_CELL bold_italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG , (6)

where ΔΔ\Deltaroman_Δ is the Laplacian operator: Δg(𝐱)=i=1nxig(𝐱)Δ𝑔𝐱superscriptsubscript𝑖1𝑛subscriptsubscript𝑥𝑖𝑔𝐱\Delta g(\mathbf{x})=\sum_{i=1}^{n}\partial_{x_{i}}g(\mathbf{x})roman_Δ italic_g ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g ( bold_x ). For 𝜽¯n/n𝜽superscript¯𝑛superscript𝑛\bm{\theta}\in\bar{\mathbb{R}}^{n}/\mathbb{R}^{n}bold_italic_θ ∈ over¯ start_ARG blackboard_R end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT / blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we define u(τ,𝜽)=lim𝜽n𝜽u(τ,𝜽n)𝑢𝜏𝜽subscriptsubscript𝜽𝑛𝜽𝑢𝜏subscript𝜽𝑛u(\tau,\bm{\theta})=\lim_{\bm{\theta}_{n}\rightarrow\bm{\theta}}u(\tau,\bm{% \theta}_{n})italic_u ( italic_τ , bold_italic_θ ) = roman_lim start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → bold_italic_θ end_POSTSUBSCRIPT italic_u ( italic_τ , bold_italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), where {𝜽n}subscript𝜽𝑛\{\bm{\theta}_{n}\}{ bold_italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is a sequence in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT converged to 𝜽𝜽\bm{\theta}bold_italic_θ. Heat equation in the combinatorial optimization exhibits two beneficial characteristics. Firstly, the propagation speed of heat is infinite [22], implying that the information can reach the solver instantaneously. Secondly, the location of the global minima does not change across time τ𝜏\tauitalic_τ, as demonstrated in the following theorem.

Theorem 1.

For any τ>0𝜏0\tau>0italic_τ > 0, the function u(τ,𝛉)𝑢𝜏𝛉u(\tau,\bm{\theta})italic_u ( italic_τ , bold_italic_θ ) and h(𝛉)𝛉h(\bm{\theta})italic_h ( bold_italic_θ ) has the same global minima in ¯nsuperscript¯𝑛\bar{\mathbb{R}}^{n}over¯ start_ARG blackboard_R end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT

argmin𝜽¯nu(τ,𝜽)=argmin𝜽¯nh(𝜽)subscript𝜽superscript¯𝑛𝑢𝜏𝜽subscript𝜽superscript¯𝑛𝜽\displaystyle{\arg\min}_{\bm{\theta}\in\bar{\mathbb{R}}^{n}}u(\tau,\bm{\theta}% )={\arg\min}_{\bm{\theta}\in\bar{\mathbb{R}}^{n}}h(\bm{\theta})roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ over¯ start_ARG blackboard_R end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_u ( italic_τ , bold_italic_θ ) = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ over¯ start_ARG blackboard_R end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_h ( bold_italic_θ ) (7)

Consequentially, we can generalize the gradient descent approach Eq. (3) by substituting the function h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ) with u(τt,𝜽)𝑢subscript𝜏𝑡𝜽u(\tau_{t},\bm{\theta})italic_u ( italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ ) for different τt>0subscript𝜏𝑡0\tau_{t}>0italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 0 at different iteration step t𝑡titalic_t in Eq. (3), as follows

𝜽t+1=𝜽tγ𝜽u(τt,𝜽t),subscript𝜽𝑡1subscript𝜽subscript𝜽𝑡𝛾𝑢subscript𝜏𝑡subscript𝜽𝑡\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\gamma\bigtriangledown_{\bm{% \theta}}u(\tau_{t},\bm{\theta}_{t}),bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_u ( italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (8)

where the subscript ’t’ in τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT means that τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can vary across different steps. In this way, the solver can receive the gradient information about distant region of the landscape that is propagated by the heat diffusion, resulting in a more efficient navigation. However, Eq. (8) will converge to θˇi={+,si=+1,si=1subscriptˇ𝜃𝑖casessuperscriptsubscript𝑠𝑖1otherwisesuperscriptsubscript𝑠𝑖1otherwise\check{\theta}_{i}=\begin{cases}+\infty,\quad s_{i}^{\ast}=+1\\ -\infty,\quad s_{i}^{\ast}=-1\end{cases}overroman_ˇ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL + ∞ , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = + 1 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - ∞ , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = - 1 end_CELL start_CELL end_CELL end_ROW, since 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are unbounded. To make the procedure Eq. (8) practicable, we project the 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT back to \mathcal{I}caligraphic_I after gradient descent at each iteration

𝜽t+1=Proj(𝜽tγ𝜽u(τt,𝜽t)),subscript𝜽𝑡1subscriptProjsubscript𝜽subscript𝜽𝑡𝛾𝑢subscript𝜏𝑡subscript𝜽𝑡\displaystyle\bm{\theta}_{t+1}=\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{% t}-\gamma\bigtriangledown_{\bm{\theta}}u(\tau_{t},\bm{\theta}_{t})\big{)},bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_Proj start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_u ( italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (9)

so that 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}\in\mathcal{I}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_I always holds, where we define the projection as Proj(𝐱)i=min(1,max(0,xi))subscriptProjsubscript𝐱𝑖10subscript𝑥𝑖\mathrm{Proj}_{\mathcal{I}}(\mathbf{x})_{i}=\min(1,\max(0,x_{i}))roman_Proj start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_min ( 1 , roman_max ( 0 , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) for i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n and for 𝐱n𝐱superscript𝑛\mathbf{x}\in\mathbb{R}^{n}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Eq. (9) is a reasonable update rule for finding the minimum 𝜽superscript𝜽\bm{\theta}^{\ast}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT within \mathcal{I}caligraphic_I for the following two reasons: (1) The projection of the 𝜽ˇˇ𝜽\check{\bm{\theta}}overroman_ˇ start_ARG bold_italic_θ end_ARG is the minimum of h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ) in \mathcal{I}caligraphic_I, i.e., Proj(𝜽ˇ)=𝜽subscriptProjˇ𝜽superscript𝜽\mathrm{Proj}_{\mathcal{I}}\big{(}\check{\bm{\theta}}\big{)}=\bm{\theta}^{\ast}roman_Proj start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( overroman_ˇ start_ARG bold_italic_θ end_ARG ) = bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT; (2) Due to the property of the projection, if the solver moves towards 𝜽ˇˇ𝜽\check{\bm{\theta}}overroman_ˇ start_ARG bold_italic_θ end_ARG, it also gets closer to 𝜽superscript𝜽\bm{\theta}^{\ast}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. More importantly, since the coordinates of 𝜽ˇˇ𝜽\check{\bm{\theta}}overroman_ˇ start_ARG bold_italic_θ end_ARG are all infinite, the convergent point of Eq. (9) must be one of the vertices of \mathcal{I}caligraphic_I, i.e., {0,1}nsuperscript01𝑛\{0,1\}^{n}{ 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. This suggests that Eq. (9) tends to give an output 𝜽𝜽\bm{\theta}bold_italic_θ that diminishes the uncertainty V(𝜽)𝑉𝜽V(\bm{\theta})italic_V ( bold_italic_θ ) (Eq. (4)).

3.2 Solving the heat equation

To develop an algorithm for solving the combinatorial optimization problem from Eq. (9), we must solve the heat equation Eq. (6), which seems a significant challenge when the dimension n𝑛nitalic_n is high. Fortunately, the solution has a closed form if the target function f(𝐬)𝑓𝐬f(\mathbf{s})italic_f ( bold_s ) can be written as a multi-linear polynomial of 𝐬𝐬\mathbf{s}bold_s

f(𝐬)=a0+i1a1,i1si1+i1<i2a2,i1i2si1s12++i1<<iKaK,i1iKsi1siK,𝑓𝐬subscript𝑎0subscriptsubscript𝑖1subscript𝑎1subscript𝑖1subscript𝑠subscript𝑖1subscriptsubscript𝑖1subscript𝑖2subscript𝑎2subscript𝑖1subscript𝑖2subscript𝑠subscript𝑖1subscript𝑠subscript12subscriptsubscript𝑖1subscript𝑖𝐾subscript𝑎𝐾subscript𝑖1subscript𝑖𝐾subscript𝑠subscript𝑖1subscript𝑠subscript𝑖𝐾\displaystyle\begin{aligned} f(\mathbf{s})=a_{0}+\sum_{i_{1}}a_{1,i_{1}}s_{i_{% 1}}+\sum_{i_{1}<i_{2}}a_{2,i_{1}i_{2}}s_{i_{1}}s_{1_{2}}+\cdots+\sum_{i_{1}<% \cdots<i_{K}}a_{K,i_{1}\ldots i_{K}}s_{i_{1}}\cdots s_{i_{K}},\end{aligned}start_ROW start_CELL italic_f ( bold_s ) = italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ⋯ + ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < ⋯ < italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_K , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_s start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT , end_CELL end_ROW (10)

a condition met by a wide range of combinatorial optimization problems [16].

Theorem 2.

Supposed that f(𝐬)𝑓𝐬f(\mathbf{s})italic_f ( bold_s ) is a multilinear polynomial of 𝐬𝐬\mathbf{s}bold_s, then the solution to Eq. (6) is

u(τ,𝜽)=𝔼p(𝐱)[f(erf(𝜽𝐱τ))],𝐱Unif[0,1]n,formulae-sequence𝑢𝜏𝜽subscript𝔼𝑝𝐱delimited-[]𝑓erf𝜽𝐱𝜏𝐱Unifsuperscript01𝑛\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{x})}[f(\mathrm{erf}(% \frac{\bm{\theta}-\mathbf{x}}{\sqrt{\tau}}))],\quad\mathbf{x}\in\mathrm{Unif}[% 0,1]^{n},italic_u ( italic_τ , bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ italic_f ( roman_erf ( divide start_ARG bold_italic_θ - bold_x end_ARG start_ARG square-root start_ARG italic_τ end_ARG end_ARG ) ) ] , bold_x ∈ roman_Unif [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , (11)

where erf(x)=2π0xet2𝑑terf𝑥2𝜋superscriptsubscript0𝑥superscript𝑒superscript𝑡2differential-d𝑡\mathrm{erf}(x)=\frac{2}{\sqrt{\pi}}\int_{0}^{x}e^{-t^{2}}dtroman_erf ( italic_x ) = divide start_ARG 2 end_ARG start_ARG square-root start_ARG italic_π end_ARG end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_d italic_t is the error function.

For more general cases, we still use the approximation Eq. (11).

3.3 Proposed algorithm

Based on Eq. (9) and Thm. 2, we proposed Heat diffusion optimization (HeO), a gradient-based algorithm for combinatorial optimization, as illustrated in Alg. 1, where we estimate Eq. (11) with one sample 𝐱Unif[0,1]nsimilar-to𝐱Unifsuperscript01𝑛\mathbf{x}\sim\mathrm{Unif}[0,1]^{n}bold_x ∼ roman_Unif [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and we denote τt=σtsubscript𝜏𝑡subscript𝜎𝑡\sqrt{\tau}_{t}=\sigma_{t}square-root start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for short. Our HeO can be equipped with momentum, which is shown in Alg. 3 in Appendix. In contrast to those methods designed for solving special class of PBO (Eq. (1)) such as quadratic unconstrained binary optimization (QUBO), our HeO can directly solve PBO problems with general form. Although PBO can be represented as QUBO [30], this necessitates the introduction of auxiliary variables, which may consequently increase the problem size and leading to additional computational overhead [31]. Compared to other algorithms, our HeO has relatively low complexity. The most computationally intensive operation at each step is gradient calculation, which can be explicitly expressed or efficiently computed with tools like PyTorch’s autograd, and can be accelerated using GPUs. As shown in Fig. S1 of the in Appendix, the time cost per iteration of our methods increases linearly with the problem dimension, with a small constant coefficient. Since the overall computational time mainly depends on the number of iterations times the time cost per iteration, our HeO is efficient even in high-dimensional cases.

Algorithm 1 Heat diffusion optimization (HeO)
Input: target function f()𝑓f(\cdot)italic_f ( ⋅ ), step size γ𝛾\gammaitalic_γ, σ𝜎\sigmaitalic_σ schedule {σt}subscript𝜎𝑡\{\sigma_{t}\}{ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, iteration number T𝑇Titalic_T
initialize elements of 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as 0.50.50.50.5
for t=0𝑡0t=0italic_t = 0 to T1𝑇1T-1italic_T - 1 do
     sample 𝐱tUnif[0,1]nsimilar-tosubscript𝐱𝑡Unifsuperscript01𝑛\mathbf{x}_{t}\sim\mathrm{Unif}[0,1]^{n}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_Unif [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
     𝐠t𝜽f(erf(𝜽t𝐱tσt))subscript𝐠𝑡subscript𝜽𝑓erfsubscript𝜽𝑡subscript𝐱𝑡subscript𝜎𝑡\mathbf{g}_{t}\leftarrow\bigtriangledown_{\bm{\theta}}f(\mathrm{erf}(\frac{\bm% {\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}}))bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_f ( roman_erf ( divide start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) )
     𝜽t+1Proj(𝜽tγ𝐠t)subscript𝜽𝑡1subscriptProjsubscript𝜽𝑡𝛾subscript𝐠𝑡\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}-% \gamma\mathbf{g}_{t}\big{)}bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← roman_Proj start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
end for
Output: binary configuration 𝐬T=sgn(𝜽T0.5)subscript𝐬𝑇sgnsubscript𝜽𝑇0.5\mathbf{s}_{T}=\mathrm{sgn}(\bm{\theta}_{T}-0.5)bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = roman_sgn ( bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 0.5 )

One counter-intuitive thing is that to minimize the target function h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ), the HeO actually are minimizing different functions u(τt,𝜽)𝑢subscript𝜏𝑡𝜽u(\tau_{t},\bm{\theta})italic_u ( italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ ) at different step t𝑡titalic_t. We interpret this by providing an upper bound for the target optimization loss h(𝜽)h(𝜽)𝜽superscript𝜽h(\bm{\theta})-h(\bm{\theta}^{\ast})italic_h ( bold_italic_θ ) - italic_h ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

Theorem 3.

Denote f˘=max𝐬f(𝐬)˘𝑓subscript𝐬𝑓𝐬\breve{f}=\max_{\mathbf{s}}f(\mathbf{s})over˘ start_ARG italic_f end_ARG = roman_max start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT italic_f ( bold_s ). Given τ2>0subscript𝜏20\tau_{2}>0italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 and ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, there exists τ1(0,τ2)subscript𝜏10subscript𝜏2\tau_{1}\in(0,\tau_{2})italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ ( 0 , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), such that

h(𝜽)h(𝜽)[(f˘f)(u(τ2,𝜽)u(τ2,𝜽)+n2τ1τ2u(τ,𝜽)u(τ,𝜽)τ𝑑τ)]1/2+ϵ.𝜽superscript𝜽superscriptdelimited-[]˘𝑓superscript𝑓𝑢subscript𝜏2superscript𝜽𝑢subscript𝜏2𝜽𝑛2superscriptsubscriptsubscript𝜏1subscript𝜏2𝑢𝜏𝜽𝑢𝜏superscript𝜽𝜏differential-d𝜏12italic-ϵ\displaystyle\begin{aligned} h(\bm{\theta})-h(\bm{\theta}^{\ast})\leq\big{[}(% \breve{f}-f^{\ast})\big{(}u(\tau_{2},\bm{\theta}^{\ast})-u(\tau_{2},\bm{\theta% })+\frac{n}{2}\int_{\tau_{1}}^{\tau_{2}}\frac{u(\tau,\bm{\theta})-u(\tau,\bm{% \theta}^{\ast})}{\tau}d\tau\big{)}\big{]}^{1/2}+\epsilon.\end{aligned}start_ROW start_CELL italic_h ( bold_italic_θ ) - italic_h ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ [ ( over˘ start_ARG italic_f end_ARG - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( italic_u ( italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_u ( italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_θ ) + divide start_ARG italic_n end_ARG start_ARG 2 end_ARG ∫ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_u ( italic_τ , bold_italic_θ ) - italic_u ( italic_τ , bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_τ end_ARG italic_d italic_τ ) ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT + italic_ϵ . end_CELL end_ROW (12)

Accordingly, minimizing u(τ,𝜽)𝑢𝜏𝜽u(\tau,\bm{\theta})italic_u ( italic_τ , bold_italic_θ ) for each τ𝜏\tauitalic_τ cooperatively aids in minimizing the original target function h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ). Thus, we refer to HeO as a cooperative optimization paradigm, as illustrated in Fig. 1.

4 Experiments

We apply our HeO to a variety of NP-hard combinatorial optimization problems to demonstrate its broad applicability. Unless explicitly stated otherwise, we employ the τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT schedule as τt=:σt=2(1t/T)\sqrt{\tau_{t}}=:\sigma_{t}=2(1-t/T)square-root start_ARG italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = : italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 2 ( 1 - italic_t / italic_T ) for HeO throughout this work. The sensitivity of other parameter settings including the step size γ𝛾\gammaitalic_γ and iterations T𝑇Titalic_T are shown in Fig. S2 This choice is motivated by the idea that the reversed direction of heat flow guides the solver towards the original of its source, i.e., the global minima. Noticed that this choice is not theoretically necessary, as elaborated in Discussion.

Refer to caption
Figure 2: Performance of HeO (Alg. 3, Appendix), Monte Carlo gradient estimation (MCGE), Hopfield neural network (HNN) and simulated annealing (SA) on minimizing the output of a neural network (Eq. (13)). Top panel: the energy (f()𝑓f(\cdot)italic_f ( ⋅ )). Bottom panel: the uncertainty V(𝜽)𝑉𝜽V(\bm{\theta})italic_V ( bold_italic_θ ) (Eq. (4)).

Toy example. We consider the following target function

f(𝐬)=𝐚2Tsigmoid(W𝐬+𝐚1)𝑓𝐬superscriptsubscript𝐚2Tsigmoid𝑊𝐬subscript𝐚1\displaystyle f(\mathbf{s})=\mathbf{a}_{2}^{\mathrm{T}}\mathrm{sigmoid}(W% \mathbf{s}+\mathbf{a}_{1})italic_f ( bold_s ) = bold_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT roman_sigmoid ( italic_W bold_s + bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (13)

where sigmoid(x)=11+exsigmoid𝑥11superscript𝑒𝑥\mathrm{sigmoid}(x)=\tfrac{1}{1+e^{-x}}roman_sigmoid ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT end_ARG, and the elements of the network parameters 𝐚1nsubscript𝐚1superscript𝑛\mathbf{a}_{1}\in\mathbb{R}^{n}bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, 𝐚2msubscript𝐚2superscript𝑚\mathbf{a}_{2}\in\mathbb{R}^{m}bold_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, Wm,n𝑊superscript𝑚𝑛W\in\mathbb{R}^{m,n}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT are uniformly sampled from [1,1]11[-1,1][ - 1 , 1 ] and fixed during optimizing. According to the universal approximation theory [32], f(𝐬)𝑓𝐬f(\mathbf{s})italic_f ( bold_s ) can approximate any continuous function with sufficiently large m𝑚mitalic_m, thereby representing a general target function. We compare the performance of our HeO with momentum (Alg. 3) against several representative methods: the conventional gradient-based solver MCGE [28] (with or without momentum), the simulated annealing [33], and the Hopfield neural network [34]. As shown in Fig. 2, our HeO demonstrates exceptional superiority over all other methods, and efficiently reduces its uncertainty compared to MCGE.

Refer to caption
Figure 3: a, Illustration of the max-cut problem. b, Performance of HeO (Alg. 1) and representative iterative approximation methods including LQA [10], aSB [12], bSB [13], dSB [13], CIM [35] and SIM-CIM [15] on max-cut problems from the Biq Mac Library [36]. Top panel: average relative loss for each algorithm over all problems. Bottom panel: the count of instances where each algorithm ended up with one of the bottom-2 worst results among the 7 algorithms.

Quadratic unconstrained binary optimization (QUBO). QUBO is the combinatorial optimization problem with quadratic form target function (Jn×n𝐽superscript𝑛𝑛J\in\mathbb{R}^{n\times n}italic_J ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is a symmetric matrix with zero diagonals)

f(𝐬)=𝐬TJ𝐬.𝑓𝐬superscript𝐬T𝐽𝐬\displaystyle f(\mathbf{s})=\mathbf{s}^{\mathrm{T}}J\mathbf{s}.italic_f ( bold_s ) = bold_s start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT italic_J bold_s . (14)

This corresponds to the case where K=2𝐾2K=2italic_K = 2 in Eq. (10). A well-known class of QUBO is max-cut problem [27], in which we divide the vertices of a graph into two distinct subsets and aim to maximize the number of edges between them. Its target function is expressed as Eq. (14), where J𝐽Jitalic_J is determined by the adjacency matrix of the graph.

We compare our HeO with representative iterative approximation methods especially developed for solving QUBO including LQA [10], aSB [12], bSB [13], dSB [13], CIM [35], and SIM-CIM [15] on max-cut problems in the Biq Mac Library [36]111https://biqmac.aau.at/biqmaclib.html. We report the relative loss averaged over all instances and the count of the instances where each algorithm gives the bottom-2 worse result among the 7 algorithms. As shown in Fig. 3, our HeO is superior to other methods in terms of two metrics.

Refer to caption
Figure 4: a, Illustration of the Boolean 3-satisfiability (3-SAT) problem. b, Performance of HeO (Alg. 4, Appendix), 2-order and 3-order oscillation Ising machine (OIM) [16] on 3-SAT problems with various number of variables from the SATLIB [37]. We report the mean percent of constraints satisfied (left) and probability of satisfying all claims (right) for each algorithm.

Polynomial unconstrained binary optimization (PUBO). PUBO is a class of combinatorial optimization problems, in which higher-order interactions between bits sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT appears in the target function. Existing methods for solving PUBO fall into two categories: the first approach involves transforming PUBO into QUBO by adding auxiliary variables through a quadratization process, and then solving it as a QUBO problem [38], and the one directly solves PUBO [16]. Quadratization may dramatically increases the dimension of the problem, hence brings heavier computational overhead, while our HeO can be directly used for solving PUBO. A well-known class of PUBO is the Boolean 3333-satisfiability (3333-SAT) problem [27], which involves determining the satisfiability of a Boolean formula over n𝑛nitalic_n Boolean variables b1,,bnsubscript𝑏1subscript𝑏𝑛b_{1},\ldots,b_{n}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT where bi{0,1}subscript𝑏𝑖01b_{i}\in\{0,1\}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 }. The Boolean formula is structured in Conjunctive Normal Form (CNF) consisting of H𝐻Hitalic_H conjunction (\wedge) of clauses, and each clause hhitalic_h is a disjunction (\vee) of exactly three literals (either a Boolean variable or its negation). An algorithm of 3333-SAT aims to find the Boolean variables that makes as many as clauses satisfied.

To apply our HeO to the 3-SAT, we encode each Boolean variable bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is assigned with value 1111 if bi=1subscript𝑏𝑖1b_{i}=1italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, otherwise si=1subscript𝑠𝑖1s_{i}=-1italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1. For a literal, we define a value chisubscript𝑐subscript𝑖c_{h_{i}}italic_c start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which is 11-1- 1 if the literal is the negation of the corresponding Boolean variable, otherwise it is 1111. Then finding the Boolean variables that satisfies as many as clauses is equivalent to minimize the target function

f(𝐬)=h=1Hi=131chishi2.𝑓𝐬superscriptsubscript1𝐻superscriptsubscriptproduct𝑖131subscript𝑐subscript𝑖subscript𝑠subscript𝑖2\displaystyle f(\mathbf{s})=\sum_{h=1}^{H}\prod_{i=1}^{3}\frac{1-c_{h_{i}}s_{h% _{i}}}{2}.italic_f ( bold_s ) = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT divide start_ARG 1 - italic_c start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG . (15)

This corresponds to the case where K=3𝐾3K=3italic_K = 3 in Eq. (10). We compared our HeO (Alg. 4, Appendix) with the second-order oscillator Ising machines (OIM) solver that using quadratization and the state-of-art 3-order OIM proposed in [16] on 3-SAT instance in SATLIB 222https://www.cs.ubc.ca/~hoos/SATLIB/benchm.html. As shown in Fig. 4, our HeO is superior to other methods in attaining higher quality of solutions and finding more the complete satisfiable solution (solutions that satisfying all clauses). Notably, for the cases of 175-250 variables, our HeO is able to find more complete satisfiable solutions, compared to the 3-order OIM, while the 2-order OIM fails to find any complete solutions [16].

Refer to caption
Figure 5: a, Training a network with ternary-value parameters. b, The weight value accuracy of the HeO (Alg. 5, Appendix) and Monte Carlo gradient estimation (MCGE) with momentum under different sizes of training set. (n=100,m=1,5,20formulae-sequence𝑛100𝑚1520n=100,m=1,5,20italic_n = 100 , italic_m = 1 , 5 , 20). For each test, we estimate the mean from 10 runs.

Ternary optimization. Neural networks excel in learning and modeling complex functions, but they also bring about considerable computational demands due to their vast number of parameters. A promising strategy to mitigate this issue is quantization, which converts network parameters into discrete values [39]. However, directly training networks with discrete parameters introduces a significant challenge due to the high-dimensional combinatorial optimization problem it presents.

We apply our HeO and MCGE to directly train neural networks with ternary value (1,0,1101{-1,0,1}- 1 , 0 , 1). Supposed that we have an input-output dataset 𝒟𝒟\mathcal{D}caligraphic_D generated by a ground-truth ternary single-layer perceptron 𝐲=𝚪(𝐯;WGT)=Relu(WGT𝐯)𝐲𝚪𝐯subscript𝑊GTRelusubscript𝑊GT𝐯\mathbf{y}=\bm{\Gamma}(\mathbf{v};W_{\mathrm{GT}})=\mathrm{Relu}(W_{\mathrm{GT% }}\mathbf{v})bold_y = bold_Γ ( bold_v ; italic_W start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT ) = roman_Relu ( italic_W start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT bold_v ), where Relu(x)=max{0,x}Relu𝑥0𝑥\mathrm{Relu}(x)=\max\{0,x\}roman_Relu ( italic_x ) = roman_max { 0 , italic_x }, WGT{1,0,1}m×nsubscript𝑊GTsuperscript101𝑚𝑛W_{\mathrm{GT}}\in\{-1,0,1\}^{m\times n}italic_W start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT ∈ { - 1 , 0 , 1 } start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT is the ground-truth ternary weight parameter, 𝐯{1,0,1}n𝐯superscript101𝑛\mathbf{v}\in\{-1,0,1\}^{n}bold_v ∈ { - 1 , 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the input, and 𝐲𝐲\mathbf{y}bold_y is the model output. We aim to find the ternary configuration W{1,0,1}m×n𝑊superscript101𝑚𝑛W\in\{-1,0,1\}^{m\times n}italic_W ∈ { - 1 , 0 , 1 } start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT minimizing the loss MSE(W,𝒟)=1|𝒟|(𝐯,𝐲)𝒟𝚪(𝐯;W)𝐲2MSE𝑊𝒟1𝒟subscript𝐯𝐲𝒟superscriptnorm𝚪𝐯𝑊𝐲2\mathrm{MSE}(W,\mathcal{D})=\tfrac{1}{\left|\mathcal{D}\right|}\sum_{(\mathbf{% v},\mathbf{y})\in\mathcal{D}}\left\|\bm{\Gamma}(\mathbf{v};W)-\mathbf{y}\right% \|^{2}roman_MSE ( italic_W , caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( bold_v , bold_y ) ∈ caligraphic_D end_POSTSUBSCRIPT ∥ bold_Γ ( bold_v ; italic_W ) - bold_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We generalize our HeO from the binary to the ternary case by representing a ternary variable with two bits, where each element of W𝑊Witalic_W can be represented as a function of 𝐬𝐬\mathbf{s}bold_s (see Alg. 5, Appendix for details), and the target function is defined as

f(𝐬)=1|𝒟|(𝐯,𝐲)𝒟𝚪(𝐯;W(𝐬))𝐲2.𝑓𝐬1𝒟subscript𝐯𝐲𝒟superscriptnorm𝚪𝐯𝑊𝐬𝐲2\displaystyle f(\mathbf{s})=\frac{1}{\left|\mathcal{D}\right|}\sum_{(\mathbf{v% },\mathbf{y})\in\mathcal{D}}\left\|\bm{\Gamma}(\mathbf{v};W(\mathbf{s}))-% \mathbf{y}\right\|^{2}.italic_f ( bold_s ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( bold_v , bold_y ) ∈ caligraphic_D end_POSTSUBSCRIPT ∥ bold_Γ ( bold_v ; italic_W ( bold_s ) ) - bold_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (16)

As shown in Fig. 5, HeO robustly exceeds MCGE under different dataset size |𝒟|𝒟|\mathcal{D}|| caligraphic_D | and output size m𝑚mitalic_m.

Refer to caption
Figure 6: The variable selection of 400400400400-dimensional linear regressions using HeO (Alg. 6, Appendix), Lasso (L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) regression [40] and L0.5subscript𝐿0.5L_{0.5}italic_L start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT regression [41]. We report the accuracy of each algorithm in determining whether each variable should be ignored for prediction and their MSE on the test set. The mean (dots) and standard deviation (bars) are estimated over 10 runs.

Mixed combinatorial optimization. In high-dimensional linear regression, usually only a small fraction of the variables significantly contribute to prediction. Identifying and selecting a subset of variables with strong predictive power—a process known as variable selection—is crucial, as it improves the generalizability and interpretability of the regression model [42]. However, direct variable selection is an NP-hard combinatorial optimization mixed with continuous variables [43]. As a practical alternative, regularization methods like Lasso algorithm are commonly employed [40].

Supposed a dataset is generated from a linear model, in which the relation between input 𝐯𝐯\mathbf{v}bold_v and output y𝑦yitalic_y is y=𝜷𝐯+ϵ𝑦superscript𝜷𝐯italic-ϵy=\bm{\beta}^{\ast}\cdot\mathbf{v}+\epsilonitalic_y = bold_italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋅ bold_v + italic_ϵ, where 𝜷superscript𝜷\bm{\beta}^{\ast}bold_italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the ground-truth linear coefficient and ϵitalic-ϵ\epsilonitalic_ϵ is independent Gaussian noise with standard deviation σϵsubscript𝜎italic-ϵ\sigma_{\epsilon}italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT. Suppose that only a small proportion (denoted as q(0,1)𝑞01q\in(0,1)italic_q ∈ ( 0 , 1 )) of coordinates in βisuperscriptsubscript𝛽𝑖\beta_{i}^{\ast}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are non-zero. Our goal is to identify these coordinates through an indicator vector 𝐬{1,1}n𝐬superscript11𝑛\mathbf{s}\in\{-1,1\}^{n}bold_s ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (1111 for selection and 11-1- 1 for non-selection) and estimate these non-zero coefficients. The target function of the problem is (𝟏n1superscript𝑛\mathbf{1}\in\mathbb{R}^{n}bold_1 ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the all-one vector)

f(𝐬,𝜷)=1|𝒟|(𝐯,𝐲)𝒟|(𝜷𝐬+𝟏2)𝐯𝐲|2.𝑓𝐬𝜷1𝒟subscript𝐯𝐲𝒟superscriptdirect-product𝜷𝐬12𝐯𝐲2\displaystyle f(\mathbf{s},\bm{\beta})=\frac{1}{\left|\mathcal{D}\right|}\sum_% {(\mathbf{v},\mathbf{y})\in\mathcal{D}}\left|\big{(}\bm{\beta}\odot\frac{% \mathbf{s}+\mathbf{1}}{2}\big{)}\cdot\mathbf{v}-\mathbf{y}\right|^{2}.italic_f ( bold_s , bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( bold_v , bold_y ) ∈ caligraphic_D end_POSTSUBSCRIPT | ( bold_italic_β ⊙ divide start_ARG bold_s + bold_1 end_ARG start_ARG 2 end_ARG ) ⋅ bold_v - bold_y | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (17)

We solve the problem through HeO (Alg. 6, Appendix), where we minimize the loss relative to 𝜽𝜽\bm{\theta}bold_italic_θ while slowing varying 𝜷𝜷\bm{\beta}bold_italic_β via its error gradient. After obtaining the indicator 𝐬𝐬\mathbf{s}bold_s, we conduct an ordinary least squares regression on the variables selected by the 𝐬𝐬\mathbf{s}bold_s to estimate their coefficients, and treat other variables’ coefficients as zero. As shown in Fig.6, our HeO outperforms both Lasso regression and the more advanced L0.5subscript𝐿0.5L_{0.5}italic_L start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT regression [41] in terms of producing more accurate indicators 𝐬𝐬\mathbf{s}bold_s and achieving lower test prediction errors across various q𝑞qitalic_q and σϵsubscript𝜎italic-ϵ\sigma_{\epsilon}italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT settings. Importantly, to give the variable selection prediction, our HeO does not need to know the level of q𝑞qitalic_q and σϵsubscript𝜎italic-ϵ\sigma_{\epsilon}italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT in advance.

Refer to caption
Figure 7: The illustration of minimum vertex cover.
Table 1: Attributes of real world datasets and the vertex cover sizes of HeO (Alg. 7, Appendix) and FastVC.
Graph name Vertex number Edge number FastVC HeO
tech-RL-caida 190914 607610 78306 77372 (17)
soc-youtube 495957 1936748 149458 148875 (25)
inf-roadNet-PA 1087562 1541514 588306 587401 (104)
inf-roadNet-CA 1957027 2760388 1063352 1061339 (32)
socfb-B-anon 2937612 20959854 338724 312531 (194)
socfb-A-anon 3097165 23667394 421123 387730 (355)
socfb-uci-uni 58790782 92208195 869457 867863 (36)

Constrained binary optimization.

Minimum vertex cover (MVC) is a class of the constrained combinatorial optimization which has wide applications [44], as illustrated in Fig. 7. Given an undirected graph 𝒢𝒢\mathcal{G}caligraphic_G with vertex set 𝒱𝒱\mathcal{V}caligraphic_V and edge set \mathcal{E}caligraphic_E, the MVC is to find the minimum subset 𝒱c𝒱subscript𝒱𝑐𝒱\mathcal{V}_{c}\subset\mathcal{V}caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊂ caligraphic_V, so that for each edge e𝑒e\in\mathcal{E}italic_e ∈ caligraphic_E, at least one of its endpoints belongs to 𝒱csubscript𝒱𝑐\mathcal{V}_{c}caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The target function and the constrains are expressed as

f(𝐬)=i=1nsi+12,subject to gij(𝐬)=(1si+12)(1sj+12)=0,i,j,eij.formulae-sequenceformulae-sequence𝑓𝐬superscriptsubscript𝑖1𝑛subscript𝑠𝑖12subject to subscript𝑔𝑖𝑗𝐬1subscript𝑠𝑖121subscript𝑠𝑗120for-all𝑖𝑗subscript𝑒𝑖𝑗\displaystyle f(\mathbf{s})=\sum_{i=1}^{n}\frac{s_{i}+1}{2},\quad\text{subject% to }g_{ij}(\mathbf{s})=(1-\frac{s_{i}+1}{2})(1-\frac{s_{j}+1}{2})=0,\quad% \forall i,j,e_{ij}\in\mathcal{E}.italic_f ( bold_s ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 end_ARG start_ARG 2 end_ARG , subject to italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( bold_s ) = ( 1 - divide start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 end_ARG start_ARG 2 end_ARG ) ( 1 - divide start_ARG italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + 1 end_ARG start_ARG 2 end_ARG ) = 0 , ∀ italic_i , italic_j , italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_E . (18)

We combine the HeO with penalty function (Alg. 7, Appendix) for solving MVC. We compare our HeO with FastVC [45], a powerful MVC heuristic algorithm, on massive real world graph datasets 333http://networkrepository.com/. For a fair comparison, we keep the run time of two algorithms as the same for each dataset. As shown in Tab. 1, our HeO can find smaller cover sets than that of FastVC.

5 Discussion

Existing model-based combinatorial optimization approaches encode the solution space via a parameterized distribution with iterative parameter updates [46]. In contrast to HeO, which requires only one sample per iteration, they necessitate a large number of samples per iteration. The Gibbs-With-Gradient algorithm [47] uses gradient information for combinatorial optimization but searches in the discrete solution space instead of the continuous one, as did HeO. Denoising diffusion model (DDM) [24] has been applied for solving combinatorial optimization problems [48]. Although the diffusion process in DDMs akin to the heat diffusion in our HeO, DDMs require a substantial data for training and necessitate reversing the diffusion process to generate data that from the target distribution. In contrast, HeO needs no training, and it is unnecessary to strictly adhering to the reverse time sequence τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the optimization process, as under different τ𝜏\tauitalic_τ, the function u(τ,𝜽)𝑢𝜏𝜽u(\tau,\bm{\theta})italic_u ( italic_τ , bold_italic_θ ) shares the same optima with that of the original problem h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ). This claim is empirically corroborated in Fig. S3, Appendix, where HeO applying non-monotonic schedules of τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT still demonstrates superior performance.

Our HeO can be viewed as a stochastic Gaussian continuation (GC) method [49] with projection, which has been applied for non-convex optimization, though it has not yet been used for combinatorial optimization. The optimal convexification of GC [50] underpins potentially theoretical advantages of HeO. One key distinction is that GC typically optimizes each sub-problem (corresponding to u(𝜽,τt)𝑢𝜽subscript𝜏𝑡u(\bm{\theta},\tau_{t})italic_u ( bold_italic_θ , italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )) at each t𝑡titalic_t up to some criteria, whereas HeO merely performs a single-step gradient descent. Also, Eq. (8) corresponds to a variation of the evolution strategy (ES), a robust optimizer for non-differentiable function [51], while HeO use a different gradient estimation (Eq. (11)), see Appendix for details. Additionally, our HeO is related to randomized smoothing, which has been applied to non-smooth optimization [52] or neural network regularization [53]. The distinctive feature of our HeO is that, across different τ𝜏\tauitalic_τ, the smoothed function u(τ,𝜽)𝑢𝜏𝜽u(\tau,\bm{\theta})italic_u ( italic_τ , bold_italic_θ ) retains the optima of the original function h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ) (Thm. 1). This distinguishes our HeO from methods based on quantum adiabatic theory [9], bifurcation theory [12] and other relaxation strategies [54], in which the optima of the smoothed function can be different from the original one [27, 55]. This is empirically verified in Fig. S3, Appendix.

The heat equation in our HeO can be naturally extended to general parabolic differential equations, given that a broad spectrum of them obey the backward uniqueness [56]. For example, we can use τu(τ,𝜽)=𝜽[A𝜽u(τ,𝜽)]subscript𝜏𝑢𝜏𝜽subscript𝜽delimited-[]subscript𝜽𝐴𝑢𝜏𝜽\partial_{\tau}u(\tau,\bm{\theta})=\bigtriangledown_{\bm{\theta}}[A% \bigtriangledown_{\bm{\theta}}u(\tau,\bm{\theta})]∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_u ( italic_τ , bold_italic_θ ) = ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT [ italic_A ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_u ( italic_τ , bold_italic_θ ) ] to replace the Eq. (6), where A𝐴Aitalic_A is a real positive definite matrix. Prior researches have demonstrated that the optimization procedure can be significantly accelerated by preconditioning [57] or Fisher information matrix [58], implying that choosing a proper matrix A𝐴Aitalic_A could substantially improve the efficacy of the HeO. Additionally, since τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT does not necessarily have to be monotonic due to the cooperative optimization property of HeO, it is feasible to explore various τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT schedules (possibly non-monotonic) to further enhance performance. Moreover, our HeO can be integrated with other existing techniques for combinatorial optimization problems with constraints, such as Augmented Lagrangian methods, to achieve better performance [59].

Despite the effectiveness of HeO on the combinatorial optimization problems from different domains we have considered, it has limitations. First, current HeO is inefficient for integer linear programming and routing problems, primarily due to that it is cumbersome to encode integer variables through the Bernoulli distribution in our framework. Nevertheless, integrating HeO with other techniques such as advanced Metropolis-Hastings algorithm [60] may path to broaden the applicability of our methodology to a wider range of combinatorial optimization problems. Besides, our HeO allows for further customization by incorporating additional terms that integrate problem-specific prior knowledge or by hybridizing with other metaheuristic algorithms, allowing for more effective exploration of the configuration space. Second, our HeO can not be theoretically guaranteed for converging to the global minimum. In general, finding the global minimum is not theoretically guaranteed for non-convex optimization problems [61], such as the combinatorial optimization problems studied in this paper. However, it can be demonstrated that the gradient of the target function under heat diffusion satisfies the inequality [22]:

|𝜽u(τ,𝜽)|Cτ,subscript𝜽𝑢𝜏𝜽𝐶𝜏\displaystyle\left|\bigtriangledown_{\bm{\theta}}u(\tau,\bm{\theta})\right|% \leq\frac{C}{\sqrt{\tau}},| ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_u ( italic_τ , bold_italic_θ ) | ≤ divide start_ARG italic_C end_ARG start_ARG square-root start_ARG italic_τ end_ARG end_ARG ,

where the constant C𝐶Citalic_C depends on the dimension. This implies that the target function becomes weakly convex, enabling the finding of global minima and faster convergence under certain conditions [62].

6 Conclusion

In conclusion, grounded in the heat diffusion, we present a framework called Heat diffusion optimization (HeO) to solve various combinatorial optimization problems. The heat diffusion facilitates the propagation of information from distant regions to the solver, expanding its receptive field, which in turn enhances its ability to search for global optima. Demonstrating exceptional performance across various scenarios, our HeO highlights the potential of utilizing heat diffusion to address challenges associated with navigating the solution space of combinatorial optimization.

References

  • [1] Francisco Barahona, Martin Grötschel, Michael Jünger, and Gerhard Reinelt. An application of combinatorial optimization to statistical physics and circuit layout design. Operations Research, 36(3):493–513, 1988.
  • [2] Jun Wang, Tony Jebara, and Shih-Fu Chang. Semi-supervised learning using greedy max-cut. The Journal of Machine Learning Research, 14(1):771–800, 2013.
  • [3] Chetan Arora, Subhashis Banerjee, Prem Kalra, and SN Maheshwari. An efficient graph cut algorithm for computer vision problems. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part III 11, pages 552–565. Springer, 2010.
  • [4] Hayato Ushijima-Mwesigwa, Christian FA Negre, and Susan M Mniszewski. Graph partitioning using quantum annealing on the d-wave system. In Proceedings of the Second International Workshop on Post Moores Era Supercomputing, pages 22–29, 2017.
  • [5] Florian Neukart, Gabriele Compostella, Christian Seidel, David Von Dollen, Sheir Yarkoni, and Bob Parney. Traffic flow optimization using a quantum annealer. Frontiers in ICT, 4:29, 2017.
  • [6] Román Orús, Samuel Mugel, and Enrique Lizaso. Quantum computing for finance: Overview and prospects. Reviews in Physics, 4:100028, 2019.
  • [7] Rafael Marti and Gerhard Reinelt. Exact and Heuristic Methods in Combinatorial Optimization, volume 175. Springer, 2022.
  • [8] Edward Farhi, Jeffrey Goldstone, Sam Gutmann, Joshua Lapan, Andrew Lundgren, and Daniel Preda. A quantum adiabatic evolution algorithm applied to random instances of an np-complete problem. Science, 292(5516):472–475, 2001.
  • [9] John A Smolin and Graeme Smith. Classical signature of quantum annealing. Frontiers in physics, 2:52, 2014.
  • [10] Joseph Bowles, Alexandre Dauphin, Patrick Huembeli, José Martinez, and Antonio Acín. Quadratic unconstrained binary optimization via quantum-inspired annealing. Physical Review Applied, 18(3):034016, 2022.
  • [11] Mária Ercsey-Ravasz and Zoltán Toroczkai. Optimization hardness as transient chaos in an analog approach to constraint satisfaction. Nature Physics, 7(12):966–970, 2011.
  • [12] Hayato Goto, Kosuke Tatsumura, and Alexander R Dixon. Combinatorial optimization by simulating adiabatic bifurcations in nonlinear hamiltonian systems. Science advances, 5(4):eaav2372, 2019.
  • [13] Hayato Goto, Kotaro Endo, Masaru Suzuki, Yoshisato Sakai, Taro Kanao, Yohei Hamakawa, Ryo Hidaka, Masaya Yamasaki, and Kosuke Tatsumura. High-performance combinatorial optimization based on classical mechanics. Science Advances, 7(6):eabe7953, 2021.
  • [14] Takahiro Inagaki, Yoshitaka Haribara, Koji Igarashi, Tomohiro Sonobe, Shuhei Tamate, Toshimori Honjo, Alireza Marandi, Peter L McMahon, Takeshi Umeki, Koji Enbutsu, et al. A coherent ising machine for 2000-node optimization problems. Science, 354(6312):603–606, 2016.
  • [15] Egor S Tiunov, Alexander E Ulanov, and AI Lvovsky. Annealing by simulating the coherent ising machine. Optics express, 27(7):10288–10295, 2019.
  • [16] Connor Bybee, Denis Kleyko, Dmitri E Nikonov, Amir Khosrowshahi, Bruno A Olshausen, and Friedrich T Sommer. Efficient optimization with higher-order ising machines. Nature Communications, 14(1):6033, 2023.
  • [17] Martin JA Schuetz, J Kyle Brubaker, and Helmut G Katzgraber. Combinatorial optimization with physics-inspired graph neural networks. Nature Machine Intelligence, 4(4):367–377, 2022.
  • [18] Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, pages 1–3, 2023.
  • [19] David Pisinger and Stefan Ropke. Large neighborhood search. Handbook of metaheuristics, pages 99–127, 2019.
  • [20] Pierre Hansen, Nenad Mladenović, and Jose A Moreno Perez. Variable neighbourhood search: methods and applications. Annals of Operations Research, 175:367–407, 2010.
  • [21] Haoran Sun, Hanjun Dai, Wei Xia, and Arun Ramamurthy. Path auxiliary proposal for mcmc in discrete space. In International Conference on Learning Representations, 2021.
  • [22] Lawrence C Evans. Partial differential equations, volume 19. American Mathematical Society, 2022.
  • [23] Jean-Michel Ghidaglia. Some backward uniqueness results. Nonlinear Analysis: Theory, Methods & Applications, 10(8):777–790, 1986.
  • [24] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
  • [25] Endre Boros and Peter L Hammer. Pseudo-boolean optimization. Discrete applied mathematics, 123(1-3):155–225, 2002.
  • [26] Yves Crama and Peter L Hammer. Boolean functions: Theory, algorithms, and applications. Cambridge University Press, 2011.
  • [27] Bernhard H Korte, Jens Vygen, B Korte, and J Vygen. Combinatorial optimization, volume 1. Springer, 2011.
  • [28] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte carlo gradient estimation in machine learning. The Journal of Machine Learning Research, 21(1):5183–5244, 2020.
  • [29] David Vernon Widder. The heat equation, volume 67. Academic Press, 1976.
  • [30] Avradip Mandal, Arnab Roy, Sarvagya Upadhyay, and Hayato Ushijima-Mwesigwa. Compressed quadratization of higher order binary optimization problems. In Proceedings of the 17th ACM International Conference on Computing Frontiers, pages 126–131, 2020.
  • [31] Martin Anthony, Endre Boros, Yves Crama, and Aritanan Gruber. Quadratic reformulations of nonlinear binary optimization problems. Mathematical Programming, 162:115–144, 2017.
  • [32] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
  • [33] Michel Gendreau, Jean-Yves Potvin, et al. Handbook of metaheuristics, volume 2. Springer, 2010.
  • [34] John J Hopfield and David W Tank. “neural” computation of decisions in optimization problems. Biological cybernetics, 52(3):141–152, 1985.
  • [35] Zhe Wang, Alireza Marandi, Kai Wen, Robert L Byer, and Yoshihisa Yamamoto. Coherent ising machine based on degenerate optical parametric oscillators. Physical Review A, 88(6):063853, 2013.
  • [36] Angelika Wiegele. Biq mac library—a collection of max-cut and quadratic 0-1 programming instances of medium size. Preprint, 51, 2007.
  • [37] Holger H Hoos and Thomas Stützle. Satlib: An online resource for research on sat. Sat, 2000:283–292, 2000.
  • [38] Andrew Lucas. Ising formulations of many np problems. Frontiers in physics, 2:5, 2014.
  • [39] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
  • [40] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
  • [41] Zongben Xu, Hai Zhang, Yao Wang, XiangYu Chang, and Yong Liang. L 1/2 regularization. Science China Information Sciences, 53:1159–1169, 2010.
  • [42] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  • [43] Xiaoping Li, Yadi Wang, and Rubén Ruiz. A survey on sparse learning models for feature selection. IEEE transactions on cybernetics, 52(3):1642–1660, 2020.
  • [44] Alexander C Reis, Sean M Halper, Grace E Vezeau, Daniel P Cetnar, Ayaan Hossain, Phillip R Clauer, and Howard M Salis. Simultaneous repression of multiple bacterial genes using nonrepetitive extra-long sgrna arrays. Nature biotechnology, 37(11):1294–1301, 2019.
  • [45] Shaowei Cai, Jinkun Lin, and Chuan Luo. Finding a small vertex cover in massive sparse graphs: Construct, local search, and preprocess. Journal of Artificial Intelligence Research, 59:463–494, 2017.
  • [46] Enlu Zhou and Jiaqiao Hu. Gradient-based adaptive stochastic search for non-differentiable optimization. IEEE Transactions on Automatic Control, 59(7):1818–1832, 2014.
  • [47] Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, and Chris Maddison. Oops i took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pages 3831–3841. PMLR, 2021.
  • [48] Zhiqing Sun and Yiming Yang. Difusco: Graph-based diffusion solvers for combinatorial optimization. Advances in neural information processing systems, 2023.
  • [49] Hossein Mobahi and John Fisher III. A theoretical analysis of optimization by gaussian continuation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
  • [50] Hossein Mobahi and John W Fisher. On the link between gaussian homotopy continuation and convex envelopes. In Energy Minimization Methods in Computer Vision and Pattern Recognition: 10th International Conference, EMMCVPR 2015, Hong Kong, China, January 13-16, 2015. Proceedings 10, pages 43–56. Springer, 2015.
  • [51] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research, 15(27):949–980, 2014.
  • [52] John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
  • [53] Mo Zhou, Tianyi Liu, Yan Li, Dachao Lin, Enlu Zhou, and Tuo Zhao. Toward understanding the importance of noise in training neural networks. In International Conference on Machine Learning, pages 7594–7602. PMLR, 2019.
  • [54] Nikolaos Karalias and Andreas Loukas. Erdos goes neural: an unsupervised learning framework for combinatorial optimization on graphs. Advances in Neural Information Processing Systems, 33:6659–6672, 2020.
  • [55] Haoyu Peter Wang, Nan Wu, Hang Yang, Cong Hao, and Pan Li. Unsupervised learning for combinatorial optimization with principled objective relaxation. Advances in Neural Information Processing Systems, 35:31444–31458, 2022.
  • [56] Jie Wu and Liqun Zhang. Backward uniqueness for general parabolic operators in the whole space. Calculus of Variations and Partial Differential Equations, 58:1–19, 2019.
  • [57] Hengyuan Ma, Li Zhang, Xiatian Zhu, and Jianfeng Feng. Accelerating score-based generative models with preconditioned diffusion sampling. In European Conference on Computer Vision, pages 1–16. Springer, 2022.
  • [58] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949–980, 2014.
  • [59] Ernesto G Birgin and José Mario Martínez. Practical augmented Lagrangian methods for constrained optimization. SIAM, 2014.
  • [60] Haoran Sun, Katayoon Goshvadi, Azade Nova, Dale Schuurmans, and Hanjun Dai. Revisiting sampling for combinatorial optimization. In International Conference on Machine Learning, pages 32859–32874. PMLR, 2023.
  • [61] Feng-Yi Liao, Lijun Ding, and Yang Zheng. Error bounds, pl condition, and quadratic growth for weakly convex functions, and linear convergences of proximal point methods. In 6th Annual Learning for Dynamics & Control Conference, pages 993–1005. PMLR, 2024.
  • [62] Felipe Atenas, Claudia Sagastizábal, Paulo JS Silva, and Mikhail Solodov. A unified analysis of descent sequences in weakly convex optimization, including convergence rates for bundle methods. SIAM Journal on Optimization, 33(1):89–115, 2023.
  • [63] Paul Bertens and Seong-Whan Lee. Network of evolvable neural units: Evolving to learn at a synaptic level. arXiv preprint, 2019.
  • [64] Juntao Wang, Daniel Ebler, KY Michael Wong, David Shui Wing Hui, and Jie Sun. Bifurcation behaviors shape how continuous physical dynamics solves discrete ising optimization. Nature Communications, 14(1):2510, 2023.

Supplementary Information

1 Proof of theorems

1.1 Proof of Thm. 1

To prove the Thm. 1, we recall the backward uniqueness of the heat equation [23], which asserts that the initial state of a heat equation can be uniquely determined by its state at a time point τ𝜏\tauitalic_τ under mild conditions, as shown in the following theorem.

Theorem S4.

Given two bounded function h1(𝐱),h2(𝐱)subscript1𝐱subscript2𝐱h_{1}(\mathbf{x}),h_{2}(\mathbf{x})italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) with domain on nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Denote u1(τ,𝐱)subscript𝑢1𝜏𝐱u_{1}(\tau,\mathbf{x})italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_τ , bold_x ) and u2(τ,𝐱)subscript𝑢2𝜏𝐱u_{2}(\tau,\mathbf{x})italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_τ , bold_x ) as the solutions to the heat equation (Eq. (6)) with initial condition u1(0,𝐱)=h1(𝐱)subscript𝑢10𝐱subscript1𝐱u_{1}(0,\mathbf{x})=h_{1}(\mathbf{x})italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 , bold_x ) = italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) and u2(0,𝐱)=h2(𝐱)subscript𝑢20𝐱subscript2𝐱u_{2}(0,\mathbf{x})=h_{2}(\mathbf{x})italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 0 , bold_x ) = italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) respectively. If there exists a τ>0𝜏0\tau>0italic_τ > 0, such that

u1(τ,𝐱)=u2(τ,𝐱),𝐱n,formulae-sequencesubscript𝑢1𝜏𝐱subscript𝑢2𝜏𝐱for-all𝐱superscript𝑛\displaystyle u_{1}(\tau,\mathbf{x})=u_{2}(\tau,\mathbf{x}),\quad\forall% \mathbf{x}\in\mathbb{R}^{n},italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_τ , bold_x ) = italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_τ , bold_x ) , ∀ bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , (S1)

we have h1=h2subscript1subscript2h_{1}=h_{2}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Proof of the Thm. 1.

Proof.

We first show that the global minima of u(τ,𝜽)𝑢𝜏𝜽u(\tau,\bm{\theta})italic_u ( italic_τ , bold_italic_θ ) is also the global minima of h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ). The cornerstone of the proof is Thm. S4. To utilize the backward uniqueness, we consider a reparameterization of p(𝐬|𝜽)𝑝conditional𝐬𝜽p(\mathbf{s}|\bm{\theta})italic_p ( bold_s | bold_italic_θ ) by introducing a random variable 𝐱Unif(0,1)nsimilar-to𝐱Unifsuperscript01𝑛\mathbf{x}\sim\mathrm{Unif}(0,1)^{n}bold_x ∼ roman_Unif ( 0 , 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. It is easy to see that sgn(θixi)sgnsubscript𝜃𝑖subscript𝑥𝑖\mathrm{sgn}(\theta_{i}-x_{i})roman_sgn ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) obeys Bernoulli distribution with probability θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be 1111 and 1θi1subscript𝜃𝑖1-\theta_{i}1 - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be 11-1- 1. Replacing sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Eq. (2), we have a new expression of h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ )

h(𝜽)=𝔼p(𝐱)[f(sgn(𝜽𝐱))],𝐱Unif(0,1)n.formulae-sequence𝜽subscript𝔼𝑝𝐱delimited-[]𝑓sgn𝜽𝐱similar-to𝐱Unifsuperscript01𝑛\displaystyle h(\bm{\theta})=\mathbb{E}_{p(\mathbf{x})}[f(\mathrm{sgn}(\bm{% \theta}-\mathbf{x}))],\quad\mathbf{x}\sim\mathrm{Unif}(0,1)^{n}.italic_h ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ italic_f ( roman_sgn ( bold_italic_θ - bold_x ) ) ] , bold_x ∼ roman_Unif ( 0 , 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT . (S2)

Since the solution to the heat equation can be represented by the heat kernel, we have [22]

u(τ,𝜽)=𝔼p(𝐳)[h(𝜽+2τ𝐳)],p(𝐳)=𝒩(𝟎,I).formulae-sequence𝑢𝜏𝜽subscript𝔼𝑝𝐳delimited-[]𝜽2𝜏𝐳𝑝𝐳𝒩0𝐼\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{z})}[h(\bm{\theta}+% \sqrt{2\tau}\mathbf{z})],\quad p(\mathbf{z})=\mathcal{N}(\mathbf{0},I).italic_u ( italic_τ , bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_h ( bold_italic_θ + square-root start_ARG 2 italic_τ end_ARG bold_z ) ] , italic_p ( bold_z ) = caligraphic_N ( bold_0 , italic_I ) . (S3)

Therefore, we have

u(τ,𝜽)=𝔼p(𝐳)[𝔼p(𝐱)[f(sgn(𝜽+2τ𝐳𝐱)]]=𝔼p(𝐱)[𝔼p(𝐳)[f(sgn(𝜽(𝐱+2τ𝐳))]].\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{z})}[\mathbb{E}_{p(% \mathbf{x})}[f(\mathrm{sgn}(\bm{\theta}+\sqrt{2\tau}\mathbf{z}-\mathbf{x})]]=% \mathbb{E}_{p(\mathbf{x})}[\mathbb{E}_{p(\mathbf{z})}[f(\mathrm{sgn}(\bm{% \theta}-(\mathbf{x}+\sqrt{2\tau}\mathbf{z}))]].italic_u ( italic_τ , bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ italic_f ( roman_sgn ( bold_italic_θ + square-root start_ARG 2 italic_τ end_ARG bold_z - bold_x ) ] ] = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_f ( roman_sgn ( bold_italic_θ - ( bold_x + square-root start_ARG 2 italic_τ end_ARG bold_z ) ) ] ] . (S4)

Denote u(τ,𝐱;𝜽)=𝔼p(𝐳)[f(sgn(𝜽(𝐱+2τ𝐳)))]𝑢𝜏𝐱𝜽subscript𝔼𝑝𝐳delimited-[]𝑓sgn𝜽𝐱2𝜏𝐳u(\tau,\mathbf{x};\bm{\theta})=\mathbb{E}_{p(\mathbf{z})}[f(\mathrm{sgn}(\bm{% \theta}-(\mathbf{x}+\sqrt{2\tau}\mathbf{z})))]italic_u ( italic_τ , bold_x ; bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_f ( roman_sgn ( bold_italic_θ - ( bold_x + square-root start_ARG 2 italic_τ end_ARG bold_z ) ) ) ], 𝐱(0,1)n𝐱superscript01𝑛\mathbf{x}\in(0,1)^{n}bold_x ∈ ( 0 , 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Noticed that u(τ,𝐱;𝜽)𝑢𝜏𝐱𝜽u(\tau,\mathbf{x};\bm{\theta})italic_u ( italic_τ , bold_x ; bold_italic_θ ) is the solution of the following unbounded heat equation respect to the time τ𝜏\tauitalic_τ and location 𝐱𝐱\mathbf{x}bold_x restricted on the region 𝐱(0,1)n𝐱superscript01𝑛\mathbf{x}\in(0,1)^{n}bold_x ∈ ( 0 , 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT

{τu(τ,𝐱;𝜽)=Δ𝐱u(τ,𝐱;𝜽),τ>0,𝐱nu(τ,𝐱;𝜽)=f(sgn(𝜽𝐱)),τ=0,𝐱n,\displaystyle\left\{\begin{matrix}\partial_{\tau}u(\tau,\mathbf{x};\bm{\theta}% )&=&\Delta_{\mathbf{x}}u(\tau,\mathbf{x};\bm{\theta}),&\quad\tau>0,&\quad% \mathbf{x}\in\mathbb{R}^{n}\\ u(\tau,\mathbf{x};\bm{\theta})&=&f(\mathrm{sgn}(\bm{\theta}-\mathbf{x})),&% \quad\tau=0,&\quad\mathbf{x}\in\mathbb{R}^{n}\end{matrix}\right.,{ start_ARG start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_u ( italic_τ , bold_x ; bold_italic_θ ) end_CELL start_CELL = end_CELL start_CELL roman_Δ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_u ( italic_τ , bold_x ; bold_italic_θ ) , end_CELL start_CELL italic_τ > 0 , end_CELL start_CELL bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_u ( italic_τ , bold_x ; bold_italic_θ ) end_CELL start_CELL = end_CELL start_CELL italic_f ( roman_sgn ( bold_italic_θ - bold_x ) ) , end_CELL start_CELL italic_τ = 0 , end_CELL start_CELL bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG , (S5)

hence the latter can be considered as an extension of the former. Since u(τ,𝐱;𝜽)𝑢𝜏𝐱𝜽u(\tau,\mathbf{x};\bm{\theta})italic_u ( italic_τ , bold_x ; bold_italic_θ ) is analytic respect to 𝐱(0,1)n𝐱superscript01𝑛\mathbf{x}\in(0,1)^{n}bold_x ∈ ( 0 , 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for τ>0𝜏0\tau>0italic_τ > 0, this extension is unique. Therefore, the value of u(τ,𝐱;𝜽)𝑢𝜏𝐱𝜽u(\tau,\mathbf{x};\bm{\theta})italic_u ( italic_τ , bold_x ; bold_italic_θ ) on (0,1)nsuperscript01𝑛(0,1)^{n}( 0 , 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, τ>0𝜏0\tau>0italic_τ > 0 uniquely determines the solution of Eq. (S5). Denote 𝜽ˇˇ𝜽\check{\bm{\theta}}overroman_ˇ start_ARG bold_italic_θ end_ARG as

θˇi={+,si=+1,si=1.subscriptˇ𝜃𝑖casessuperscriptsubscript𝑠𝑖1otherwisesuperscriptsubscript𝑠𝑖1otherwise\displaystyle\check{\theta}_{i}=\begin{cases}+\infty,\quad s_{i}^{\ast}=+1\\ -\infty,\quad s_{i}^{\ast}=-1.\end{cases}overroman_ˇ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL + ∞ , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = + 1 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - ∞ , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = - 1 . end_CELL start_CELL end_CELL end_ROW (S6)

Then u(τ,𝐱;𝜽ˇ)=f𝑢𝜏𝐱ˇ𝜽superscript𝑓u(\tau,\mathbf{x};\check{\bm{\theta}})=f^{\ast}italic_u ( italic_τ , bold_x ; overroman_ˇ start_ARG bold_italic_θ end_ARG ) = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, for any τ0𝜏0\tau\geq 0italic_τ ≥ 0 and 𝐱n𝐱superscript𝑛\mathbf{x}\in\mathbb{R}^{n}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and we have

u(τ,𝜽ˇ)=𝔼p(𝐱)[u(τ,𝐱;𝜽ˇ)]=f.𝑢𝜏ˇ𝜽subscript𝔼𝑝𝐱delimited-[]𝑢𝜏𝐱ˇ𝜽superscript𝑓\displaystyle u(\tau,\check{\bm{\theta}})=\mathbb{E}_{p(\mathbf{x})}[u(\tau,% \mathbf{x};\check{\bm{\theta}})]=f^{\ast}.italic_u ( italic_τ , overroman_ˇ start_ARG bold_italic_θ end_ARG ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ italic_u ( italic_τ , bold_x ; overroman_ˇ start_ARG bold_italic_θ end_ARG ) ] = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (S7)

Noticed that for 𝜽¯n𝜽superscript¯𝑛\bm{\theta}\in\bar{\mathbb{R}}^{n}bold_italic_θ ∈ over¯ start_ARG blackboard_R end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we have

u(τ,𝜽)=𝔼p(𝐱)[u(τ,𝐱;𝜽)]=𝔼p(𝐱)[𝔼p(𝐳)[f(sgn(𝜽(𝐱+2τ𝐳)))]]𝔼p(𝐱)[𝔼p(𝐳)[f]]=f,𝑢𝜏𝜽subscript𝔼𝑝𝐱delimited-[]𝑢𝜏𝐱𝜽subscript𝔼𝑝𝐱delimited-[]subscript𝔼𝑝𝐳delimited-[]𝑓sgn𝜽𝐱2𝜏𝐳subscript𝔼𝑝𝐱delimited-[]subscript𝔼𝑝𝐳delimited-[]superscript𝑓superscript𝑓\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{x})}[u(\tau,\mathbf{x}% ;\bm{\theta})]=\mathbb{E}_{p(\mathbf{x})}[\mathbb{E}_{p(\mathbf{z})}[f(\mathrm% {sgn}(\bm{\theta}-(\mathbf{x}+\sqrt{2\tau}\mathbf{z})))]]\leq\mathbb{E}_{p(% \mathbf{x})}[\mathbb{E}_{p(\mathbf{z})}[f^{\ast}]]=f^{\ast},italic_u ( italic_τ , bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ italic_u ( italic_τ , bold_x ; bold_italic_θ ) ] = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_f ( roman_sgn ( bold_italic_θ - ( bold_x + square-root start_ARG 2 italic_τ end_ARG bold_z ) ) ) ] ] ≤ blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] ] = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , (S8)

and the equality is true if and only if u(τ,𝐱;𝜽)=f𝑢𝜏𝐱𝜽superscript𝑓u(\tau,\mathbf{x};{\bm{\theta}})=f^{\ast}italic_u ( italic_τ , bold_x ; bold_italic_θ ) = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is true for 𝐱n𝐱superscript𝑛\mathbf{x}\in\mathbb{R}^{n}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Therefore, if 𝜽^^𝜽\hat{\bm{\theta}}over^ start_ARG bold_italic_θ end_ARG is the one of the minimas of u(τ,𝜽)𝑢𝜏𝜽u(\tau,{\bm{\theta}})italic_u ( italic_τ , bold_italic_θ ), and we have u(τ,𝜽^)f𝑢𝜏^𝜽superscript𝑓u(\tau,\hat{\bm{\theta}})\geq f^{\ast}italic_u ( italic_τ , over^ start_ARG bold_italic_θ end_ARG ) ≥ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Similarly, since

u(τ,𝜽^)=𝔼p(𝐱)[u(τ,𝐱;𝜽^)]=𝔼p(𝐱)[𝔼p(𝐳)[f(sgn(𝜽^(𝐱+2τ𝐳)))]]𝔼p(𝐱)[𝔼p(𝐳)[f]]=f,𝑢𝜏^𝜽subscript𝔼𝑝𝐱delimited-[]𝑢𝜏𝐱^𝜽subscript𝔼𝑝𝐱delimited-[]subscript𝔼𝑝𝐳delimited-[]𝑓sgn^𝜽𝐱2𝜏𝐳subscript𝔼𝑝𝐱delimited-[]subscript𝔼𝑝𝐳delimited-[]superscript𝑓superscript𝑓\displaystyle u(\tau,\hat{\bm{\theta}})=\mathbb{E}_{p(\mathbf{x})}[u(\tau,% \mathbf{x};\hat{\bm{\theta}})]=\mathbb{E}_{p(\mathbf{x})}[\mathbb{E}_{p(% \mathbf{z})}[f(\mathrm{sgn}(\hat{\bm{\theta}}-(\mathbf{x}+\sqrt{2\tau}\mathbf{% z})))]]\leq\mathbb{E}_{p(\mathbf{x})}[\mathbb{E}_{p(\mathbf{z})}[f^{\ast}]]=f^% {\ast},italic_u ( italic_τ , over^ start_ARG bold_italic_θ end_ARG ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ italic_u ( italic_τ , bold_x ; over^ start_ARG bold_italic_θ end_ARG ) ] = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_f ( roman_sgn ( over^ start_ARG bold_italic_θ end_ARG - ( bold_x + square-root start_ARG 2 italic_τ end_ARG bold_z ) ) ) ] ] ≤ blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] ] = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , (S9)

we also have u(τ,𝜽^)f𝑢𝜏^𝜽superscript𝑓u(\tau,\hat{\bm{\theta}})\leq f^{\ast}italic_u ( italic_τ , over^ start_ARG bold_italic_θ end_ARG ) ≤ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, hence

u(τ,𝐱;𝜽^)=f=u(τ,𝐱;𝜽),𝐱n.formulae-sequence𝑢𝜏𝐱^𝜽superscript𝑓𝑢𝜏𝐱superscript𝜽𝐱superscript𝑛\displaystyle u(\tau,\mathbf{x};\hat{\bm{\theta}})=f^{\ast}=u(\tau,\mathbf{x};% {\bm{\theta}}^{\ast}),\quad\mathbf{x}\in\mathbb{R}^{n}.italic_u ( italic_τ , bold_x ; over^ start_ARG bold_italic_θ end_ARG ) = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_u ( italic_τ , bold_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT . (S10)

Due to the backward uniqueness of the heat equation, we have

u(0,𝐱;𝜽^)=u(0,𝐱;𝜽),𝐱n,formulae-sequence𝑢0𝐱^𝜽𝑢0𝐱superscript𝜽𝐱superscript𝑛\displaystyle u(0,\mathbf{x};\hat{\bm{\theta}})=u(0,\mathbf{x};{\bm{\theta}}^{% \ast}),\quad\mathbf{x}\in\mathbb{R}^{n},italic_u ( 0 , bold_x ; over^ start_ARG bold_italic_θ end_ARG ) = italic_u ( 0 , bold_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , (S11)

that is

h(𝜽^)=h(𝜽)=f.^𝜽superscript𝜽superscript𝑓\displaystyle h(\hat{\bm{\theta}})=h({\bm{\theta}}^{\ast})=f^{\ast}.italic_h ( over^ start_ARG bold_italic_θ end_ARG ) = italic_h ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (S12)

As a result, 𝜽^^𝜽\hat{\bm{\theta}}over^ start_ARG bold_italic_θ end_ARG is the one of minimas of h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ). Conversely, using Eq. (S7), it is obviously to see that if 𝜽^^𝜽\hat{\bm{\theta}}over^ start_ARG bold_italic_θ end_ARG is one of minimas of h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ), it is also one of minimas of u(τ,𝜽)𝑢𝜏𝜽u(\tau,{\bm{\theta}})italic_u ( italic_τ , bold_italic_θ ). ∎

1.2 Proof of Thm. 2

Recall Eq. (S3) and use the definition of h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ) (see Sec. 2 of the main paper), we have

u(τ,𝜽)=𝔼p(𝐳)[𝔼p(𝐬|𝜽+𝟐τ𝐳)[f(𝐬)]].𝑢𝜏𝜽subscript𝔼𝑝𝐳delimited-[]subscript𝔼𝑝conditional𝐬𝜽2𝜏𝐳delimited-[]𝑓𝐬\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{z})}[\mathbb{E}_{p(% \mathbf{s|\bm{\theta}+\sqrt{2\tau}\mathbf{z}})}[f(\mathbf{s})]].italic_u ( italic_τ , bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_p ( bold_s | bold_italic_θ + square-root start_ARG bold_2 italic_τ end_ARG bold_z ) end_POSTSUBSCRIPT [ italic_f ( bold_s ) ] ] . (S13)

To construct a low-variance estimation, instead of directly using the Monte Carlo gradient estimation by sampling from p(𝐳)𝑝𝐳p(\mathbf{z})italic_p ( bold_z ) and p(𝐬|𝜽+𝟐τ𝐳)𝑝conditional𝐬𝜽2𝜏𝐳p(\mathbf{s|\bm{\theta}+\sqrt{2\tau}\mathbf{z}})italic_p ( bold_s | bold_italic_θ + square-root start_ARG bold_2 italic_τ end_ARG bold_z ) for gradient estimation, we manage to integrate out the stochasticity respect to 𝐳𝐳\mathbf{z}bold_z. Use the reparameterization Eq. (S2) and Eq. (S3), we have

u(τ,𝜽)=𝔼p(𝐱)[𝔼p(𝐳)[f(sgn(𝜽+2τ𝐳𝐱))]].𝑢𝜏𝜽subscript𝔼𝑝𝐱delimited-[]subscript𝔼𝑝𝐳delimited-[]𝑓sgn𝜽2𝜏𝐳𝐱\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{x})}[\mathbb{E}_{p(% \mathbf{z})}[f(\mathrm{sgn}(\bm{\theta}+\sqrt{2\tau}\mathbf{z}-\mathbf{x}))]].italic_u ( italic_τ , bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_f ( roman_sgn ( bold_italic_θ + square-root start_ARG 2 italic_τ end_ARG bold_z - bold_x ) ) ] ] . (S14)

Now we can calculate the inner term 𝔼p(𝐳)[f(sgn(𝜽+2τ𝐳𝐱))]subscript𝔼𝑝𝐳delimited-[]𝑓sgn𝜽2𝜏𝐳𝐱\mathbb{E}_{p(\mathbf{z})}[f(\mathrm{sgn}(\bm{\theta}+\sqrt{2\tau}\mathbf{z}-% \mathbf{x}))]blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_f ( roman_sgn ( bold_italic_θ + square-root start_ARG 2 italic_τ end_ARG bold_z - bold_x ) ) ]. Due to the assumption, the target function f(𝐬)𝑓𝐬f(\mathbf{s})italic_f ( bold_s ) can be written as a K𝐾Kitalic_K-order multilinear polynomial of 𝐬𝐬\mathbf{s}bold_s

f(𝐬)=a0+i1a1,i1si1+i1<i2a2,i1i2si1s12+i1<i2<i3a3,i1i2i3si1s12si3++i1<<iKaK,i1iKsi1siK.missing-subexpression𝑓𝐬subscript𝑎0subscriptsubscript𝑖1subscript𝑎1subscript𝑖1subscript𝑠subscript𝑖1subscriptsubscript𝑖1subscript𝑖2subscript𝑎2subscript𝑖1subscript𝑖2subscript𝑠subscript𝑖1subscript𝑠subscript12subscriptsubscript𝑖1subscript𝑖2subscript𝑖3subscript𝑎3subscript𝑖1subscript𝑖2subscript𝑖3subscript𝑠subscript𝑖1subscript𝑠subscript12subscript𝑠subscript𝑖3missing-subexpressionsubscriptsubscript𝑖1subscript𝑖𝐾subscript𝑎𝐾subscript𝑖1subscript𝑖𝐾subscript𝑠subscript𝑖1subscript𝑠subscript𝑖𝐾\displaystyle\begin{aligned} &f(\mathbf{s})=a_{0}+\sum_{i_{1}}a_{1,i_{1}}s_{i_% {1}}+\sum_{i_{1}<i_{2}}a_{2,i_{1}i_{2}}s_{i_{1}}s_{1_{2}}+\sum_{i_{1}<i_{2}<i_% {3}}a_{3,i_{1}i_{2}i_{3}}s_{i_{1}}s_{1_{2}}s_{i_{3}}+\cdots\\ &+\sum_{i_{1}<\cdots<i_{K}}a_{K,i_{1}\ldots i_{K}}s_{i_{1}}\cdots s_{i_{K}}.% \end{aligned}start_ROW start_CELL end_CELL start_CELL italic_f ( bold_s ) = italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_i start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 3 , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ⋯ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < ⋯ < italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_K , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_s start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT . end_CELL end_ROW (S15)

Integrating respect to each dimension of the Gaussian integral, we have [50]

𝔼p(𝐳)[f(sgn(𝜽+2τ𝐳𝐱))]=a0+i1a1,i1s~i1+i1<i2a2,i1i2s~i1s~12+i1<i2<i3a3,i1i2i3s~i1s~12s~i3++i1<<iKaK,i1iKs~i1s~iK,missing-subexpressionsubscript𝔼𝑝𝐳delimited-[]𝑓sgn𝜽2𝜏𝐳𝐱missing-subexpressionabsentsubscript𝑎0subscriptsubscript𝑖1subscript𝑎1subscript𝑖1subscript~𝑠subscript𝑖1subscriptsubscript𝑖1subscript𝑖2subscript𝑎2subscript𝑖1subscript𝑖2subscript~𝑠subscript𝑖1subscript~𝑠subscript12subscriptsubscript𝑖1subscript𝑖2subscript𝑖3subscript𝑎3subscript𝑖1subscript𝑖2subscript𝑖3subscript~𝑠subscript𝑖1subscript~𝑠subscript12subscript~𝑠subscript𝑖3missing-subexpressionsubscriptsubscript𝑖1subscript𝑖𝐾subscript𝑎𝐾subscript𝑖1subscript𝑖𝐾subscript~𝑠subscript𝑖1subscript~𝑠subscript𝑖𝐾\displaystyle\begin{aligned} &\mathbb{E}_{p(\mathbf{z})}[f(\mathrm{sgn}(\bm{% \theta}+\sqrt{2\tau}\mathbf{z}-\mathbf{x}))]\\ &=a_{0}+\sum_{i_{1}}a_{1,i_{1}}\tilde{s}_{i_{1}}+\sum_{i_{1}<i_{2}}a_{2,i_{1}i% _{2}}\tilde{s}_{i_{1}}\tilde{s}_{1_{2}}+\sum_{i_{1}<i_{2}<i_{3}}a_{3,i_{1}i_{2% }i_{3}}\tilde{s}_{i_{1}}\tilde{s}_{1_{2}}\tilde{s}_{i_{3}}\\ &+\cdots+\sum_{i_{1}<\cdots<i_{K}}a_{K,i_{1}\ldots i_{K}}\tilde{s}_{i_{1}}% \cdots\tilde{s}_{i_{K}},\end{aligned}start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_f ( roman_sgn ( bold_italic_θ + square-root start_ARG 2 italic_τ end_ARG bold_z - bold_x ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_i start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 3 , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ⋯ + ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < ⋯ < italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_K , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT , end_CELL end_ROW (S16)

where

s~i=𝔼p(zi)[sgn(θi+2τzixi)]=erf(θixiτ),subscript~𝑠𝑖subscript𝔼𝑝subscript𝑧𝑖delimited-[]sgnsubscript𝜃𝑖2𝜏subscript𝑧𝑖subscript𝑥𝑖erfsubscript𝜃𝑖subscript𝑥𝑖𝜏\displaystyle\tilde{s}_{i}=\mathbb{E}_{p(z_{i})}[\mathrm{sgn}(\theta_{i}+\sqrt% {2\tau}z_{i}-x_{i})]=\mathrm{erf}(\frac{\theta_{i}-x_{i}}{\sqrt{\tau}}),over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_sgn ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + square-root start_ARG 2 italic_τ end_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] = roman_erf ( divide start_ARG italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_τ end_ARG end_ARG ) , (S17)

where erf()erf\mathrm{erf}(\cdot)roman_erf ( ⋅ ) is the error function. Therefore, we have

u(τ,𝜽)=𝔼p(𝐱)[f(erf(𝜽𝐱τ))],𝑢𝜏𝜽subscript𝔼𝑝𝐱delimited-[]𝑓erf𝜽𝐱𝜏\displaystyle u(\tau,\bm{\theta})=\mathbb{E}_{p(\mathbf{x})}[f(\mathrm{erf}(% \frac{\bm{\theta}-\mathbf{x}}{\sqrt{\tau}}))],italic_u ( italic_τ , bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ italic_f ( roman_erf ( divide start_ARG bold_italic_θ - bold_x end_ARG start_ARG square-root start_ARG italic_τ end_ARG end_ARG ) ) ] , (S18)

where erf()erf\mathrm{erf}(\cdot)roman_erf ( ⋅ ) is the element-wise error function.

1.3 Proof of Thm. 3

Proof.

Define the square loss of 𝜽𝜽\bm{\theta}bold_italic_θ as

e(𝜽)=(h(𝜽)h(𝜽))2.𝑒𝜽superscript𝜽superscript𝜽2\displaystyle e(\bm{\theta})=(h(\bm{\theta})-h(\bm{\theta}^{\ast}))^{2}.italic_e ( bold_italic_θ ) = ( italic_h ( bold_italic_θ ) - italic_h ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (S19)

According to the definition of h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ), we have

e(𝜽)=Ep(𝐱)[f(sgn(𝜽𝐱))f(sgn(𝜽𝐱))]2Ep(𝐱)[(f(sgn(𝜽𝐱))f(sgn(𝜽𝐱)))2].𝑒𝜽subscriptE𝑝𝐱superscriptdelimited-[]𝑓sgn𝜽𝐱𝑓sgnsuperscript𝜽𝐱2subscriptE𝑝𝐱delimited-[]superscript𝑓sgn𝜽𝐱𝑓sgnsuperscript𝜽𝐱2\displaystyle e(\bm{\theta})=\mathrm{E}_{p(\mathbf{x})}[f(\mathrm{sgn(\bm{% \theta}-\mathbf{x})})-f(\mathrm{sgn(\bm{\theta}^{\ast}-\mathbf{x})})]^{2}\leq% \mathrm{E}_{p(\mathbf{x})}[(f(\mathrm{sgn(\bm{\theta}-\mathbf{x})})-f(\mathrm{% sgn(\bm{\theta}^{\ast}-\mathbf{x})}))^{2}].italic_e ( bold_italic_θ ) = roman_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ italic_f ( roman_sgn ( bold_italic_θ - bold_x ) ) - italic_f ( roman_sgn ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_x ) ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ roman_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ ( italic_f ( roman_sgn ( bold_italic_θ - bold_x ) ) - italic_f ( roman_sgn ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_x ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (S20)

Define the error function

r(τ,𝐱;𝜽)=u(τ,𝐱;𝜽)u(τ,𝐱;𝜽).𝑟𝜏𝐱𝜽𝑢𝜏𝐱𝜽𝑢𝜏𝐱superscript𝜽\displaystyle r(\tau,\mathbf{x};\bm{\theta})=u(\tau,\mathbf{x};\bm{\theta})-u(% \tau,\mathbf{x};\bm{\theta}^{\ast}).italic_r ( italic_τ , bold_x ; bold_italic_θ ) = italic_u ( italic_τ , bold_x ; bold_italic_θ ) - italic_u ( italic_τ , bold_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . (S21)

Then the error function satisfies the following heat equation

{τr(τ,𝐱;𝜽)=𝐱r(τ,𝐱;𝜽)r(0,𝐱;𝜽)=f(sgn(𝜽𝐱))f(sgn(𝜽𝐱)).\displaystyle\left\{\begin{matrix}\partial_{\tau}r(\tau,\mathbf{x};\bm{\theta}% )&=&\bigtriangledown_{\mathbf{x}}r(\tau,\mathbf{x};\bm{\theta})\\ r(0,\mathbf{x};\bm{\theta})&=&f(\mathrm{sgn}(\bm{\theta}-\mathbf{x}))-f(% \mathrm{sgn}(\bm{\theta}^{\ast}-\mathbf{x}))\end{matrix}\right..{ start_ARG start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_r ( italic_τ , bold_x ; bold_italic_θ ) end_CELL start_CELL = end_CELL start_CELL ▽ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_r ( italic_τ , bold_x ; bold_italic_θ ) end_CELL end_ROW start_ROW start_CELL italic_r ( 0 , bold_x ; bold_italic_θ ) end_CELL start_CELL = end_CELL start_CELL italic_f ( roman_sgn ( bold_italic_θ - bold_x ) ) - italic_f ( roman_sgn ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_x ) ) end_CELL end_ROW end_ARG . (S22)

Define the energy function of the error function r(τ,𝐱;𝜽)𝑟𝜏𝐱𝜽r(\tau,\mathbf{x};\bm{\theta})italic_r ( italic_τ , bold_x ; bold_italic_θ ) as

E(τ;𝜽)=nr2(τ,𝐱;𝜽)p(𝐱)𝑑𝐱.𝐸𝜏𝜽subscriptsuperscript𝑛superscript𝑟2𝜏𝐱𝜽𝑝𝐱differential-d𝐱\displaystyle E(\tau;\bm{\theta})=\int_{\mathbb{R}^{n}}r^{2}(\tau,\mathbf{x};% \bm{\theta})p(\mathbf{x})d\mathbf{x}.italic_E ( italic_τ ; bold_italic_θ ) = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_τ , bold_x ; bold_italic_θ ) italic_p ( bold_x ) italic_d bold_x . (S23)

Then applying the heat equation and the integration by parts, we have

ddτE(τ;𝜽)=2nr(τ,𝐱;𝜽)2p(𝐱)𝑑𝐱.𝑑𝑑𝜏𝐸𝜏𝜽2subscriptsuperscript𝑛superscriptnorm𝑟𝜏𝐱𝜽2𝑝𝐱differential-d𝐱\displaystyle\frac{d}{d\tau}E(\tau;\bm{\theta})=-2\int_{\mathbb{R}^{n}}\left\|% \bigtriangledown r(\tau,\mathbf{x};\bm{\theta})\right\|^{2}p(\mathbf{x})d% \mathbf{x}.divide start_ARG italic_d end_ARG start_ARG italic_d italic_τ end_ARG italic_E ( italic_τ ; bold_italic_θ ) = - 2 ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ ▽ italic_r ( italic_τ , bold_x ; bold_italic_θ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p ( bold_x ) italic_d bold_x . (S24)

Hence we have for 0<τ1<τ20subscript𝜏1subscript𝜏20<\tau_{1}<\tau_{2}0 < italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

E(τ1;𝜽)=E(τ2;𝜽)+2τ1τ2nr(τ,𝐱;𝜽)2p(𝐱)𝑑𝐱𝑑τ.𝐸subscript𝜏1𝜽𝐸subscript𝜏2𝜽2superscriptsubscriptsubscript𝜏1subscript𝜏2subscriptsuperscript𝑛superscriptnorm𝑟𝜏𝐱𝜽2𝑝𝐱differential-d𝐱differential-d𝜏\displaystyle E(\tau_{1};\bm{\theta})=E(\tau_{2};\bm{\theta})+2\int_{\tau_{1}}% ^{\tau_{2}}\int_{\mathbb{R}^{n}}\left\|\bigtriangledown r(\tau,\mathbf{x};\bm{% \theta})\right\|^{2}p(\mathbf{x})d\mathbf{x}d\tau.italic_E ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_italic_θ ) = italic_E ( italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; bold_italic_θ ) + 2 ∫ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ ▽ italic_r ( italic_τ , bold_x ; bold_italic_θ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p ( bold_x ) italic_d bold_x italic_d italic_τ . (S25)

Use the Harnack’s inequality [22], we have

r(τ,𝐱;𝜽)2r(τ,𝐱;𝜽)τr(τ,𝐱;𝜽)+n2τr2(τ,𝐱;𝜽),superscriptnorm𝑟𝜏𝐱𝜽2𝑟𝜏𝐱𝜽subscript𝜏𝑟𝜏𝐱𝜽𝑛2𝜏superscript𝑟2𝜏𝐱𝜽\displaystyle\left\|\bigtriangledown r(\tau,\mathbf{x};\bm{\theta})\right\|^{2% }\leq r(\tau,\mathbf{x};\bm{\theta})\partial_{\tau}r(\tau,\mathbf{x};\bm{% \theta})+\frac{n}{2\tau}r^{2}(\tau,\mathbf{x};\bm{\theta}),∥ ▽ italic_r ( italic_τ , bold_x ; bold_italic_θ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_r ( italic_τ , bold_x ; bold_italic_θ ) ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_r ( italic_τ , bold_x ; bold_italic_θ ) + divide start_ARG italic_n end_ARG start_ARG 2 italic_τ end_ARG italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_τ , bold_x ; bold_italic_θ ) , (S26)

combine with Eq. (S25), we have

E(τ1;𝜽)E(τ2;𝜽)+n2τ1τ2E(τ;𝜽)τ𝑑τ.𝐸subscript𝜏1𝜽𝐸subscript𝜏2𝜽𝑛2superscriptsubscriptsubscript𝜏1subscript𝜏2𝐸𝜏𝜽𝜏differential-d𝜏\displaystyle E(\tau_{1};\bm{\theta})\leq E(\tau_{2};\bm{\theta})+\frac{n}{2}% \int_{\tau_{1}}^{\tau_{2}}\frac{E(\tau;\bm{\theta})}{\tau}d\tau.italic_E ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_italic_θ ) ≤ italic_E ( italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; bold_italic_θ ) + divide start_ARG italic_n end_ARG start_ARG 2 end_ARG ∫ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_E ( italic_τ ; bold_italic_θ ) end_ARG start_ARG italic_τ end_ARG italic_d italic_τ . (S27)

Using the Minkowski inequality on the measure p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ), we have

h(𝜽)h(𝜽)=e1/2(𝜽)(n(f(sgn(𝜽𝐱))u(τ1;𝐱;𝜽))2p(𝐱)𝑑𝐱)1/2+(n(u(τ1;𝐱;𝜽)u(τ1;𝐱;𝜽))2p(𝐱)𝑑𝐱)1/2+(n(f(sgn(𝜽𝐱))u(τ1;𝐱;𝜽))2𝑑𝐱)1/2=(n(f(sgn(𝜽𝐱))u(τ1;𝐱;𝜽))2p(𝐱)𝑑𝐱)1/2+(n(f(sgn(𝜽𝐱))u(τ1;𝐱;𝜽))2𝑑𝐱)1/2+E1/2(τ1;𝜽).𝜽superscript𝜽superscript𝑒12𝜽absentsuperscriptsubscriptsuperscript𝑛superscript𝑓sgn𝜽𝐱𝑢subscript𝜏1𝐱𝜽2𝑝𝐱differential-d𝐱12missing-subexpressionsuperscriptsubscriptsuperscript𝑛superscript𝑢subscript𝜏1𝐱𝜽𝑢subscript𝜏1𝐱superscript𝜽2𝑝𝐱differential-d𝐱12missing-subexpressionsuperscriptsubscriptsuperscript𝑛superscript𝑓sgnsuperscript𝜽𝐱𝑢subscript𝜏1𝐱superscript𝜽2differential-d𝐱12missing-subexpressionabsentsuperscriptsubscriptsuperscript𝑛superscript𝑓sgn𝜽𝐱𝑢subscript𝜏1𝐱𝜽2𝑝𝐱differential-d𝐱12missing-subexpressionsuperscriptsubscriptsuperscript𝑛superscript𝑓sgnsuperscript𝜽𝐱𝑢subscript𝜏1𝐱superscript𝜽2differential-d𝐱12superscript𝐸12subscript𝜏1𝜽\displaystyle\begin{aligned} h(\bm{\theta})-h(\bm{\theta}^{\ast})=e^{1/2}(\bm{% \theta})&\leq\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{\theta}-\mathbf{x% }))-u(\tau_{1};\mathbf{x};\bm{\theta}))^{2}p(\mathbf{x})d\mathbf{x}\big{)}^{1/% 2}\\ &+\big{(}\int_{\mathbb{R}^{n}}(u(\tau_{1};\mathbf{x};\bm{\theta})-u(\tau_{1};% \mathbf{x};\bm{\theta}^{\ast}))^{2}p(\mathbf{x})d\mathbf{x}\big{)}^{1/2}\\ &+\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{\theta}^{\ast}-\mathbf{x}))-% u(\tau_{1};\mathbf{x};\bm{\theta}^{\ast}))^{2}d\mathbf{x}\big{)}^{1/2}\\ &=\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{\theta}-\mathbf{x}))-u(\tau_% {1};\mathbf{x};\bm{\theta}))^{2}p(\mathbf{x})d\mathbf{x}\big{)}^{1/2}\\ &+\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{\theta}^{\ast}-\mathbf{x}))-% u(\tau_{1};\mathbf{x};\bm{\theta}^{\ast}))^{2}d\mathbf{x}\big{)}^{1/2}+E^{1/2}% (\tau_{1};\bm{\theta}).\end{aligned}start_ROW start_CELL italic_h ( bold_italic_θ ) - italic_h ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_e start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( bold_italic_θ ) end_CELL start_CELL ≤ ( ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ( roman_sgn ( bold_italic_θ - bold_x ) ) - italic_u ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_x ; bold_italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p ( bold_x ) italic_d bold_x ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_x ; bold_italic_θ ) - italic_u ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p ( bold_x ) italic_d bold_x ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ( roman_sgn ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_x ) ) - italic_u ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d bold_x ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ( ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ( roman_sgn ( bold_italic_θ - bold_x ) ) - italic_u ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_x ; bold_italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p ( bold_x ) italic_d bold_x ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ( roman_sgn ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_x ) ) - italic_u ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d bold_x ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT + italic_E start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_italic_θ ) . end_CELL end_ROW (S28)

Recall the continuity of the heat equation:

limτ0n(u(τ,𝐱;𝜽)f(sgn(𝜽𝐱)))2p(𝐱)𝑑𝐱=0.subscript𝜏0subscriptsuperscript𝑛superscript𝑢𝜏𝐱𝜽𝑓sgn𝜽𝐱2𝑝𝐱differential-d𝐱0\displaystyle\lim_{\tau\rightarrow 0}\int_{\mathbb{R}^{n}}(u(\tau,\mathbf{x};% \bm{\theta})-f(\mathrm{sgn}(\bm{\theta}-\mathbf{x})))^{2}p(\mathbf{x})d\mathbf% {x}=0.roman_lim start_POSTSUBSCRIPT italic_τ → 0 end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u ( italic_τ , bold_x ; bold_italic_θ ) - italic_f ( roman_sgn ( bold_italic_θ - bold_x ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p ( bold_x ) italic_d bold_x = 0 . (S29)

Therefore, given ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, there exists a τ1>0subscript𝜏10\tau_{1}>0italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0, such that

(n(f(sgn(𝜽𝐱))u(τ1;𝐱;𝜽))2p(𝐱)𝑑𝐱)1/2+(n(f(sgn(𝜽𝐱))u(τ1;𝐱;𝜽))2p(𝐱)𝑑𝐱)1/2<ϵ.missing-subexpressionsuperscriptsubscriptsuperscript𝑛superscript𝑓sgn𝜽𝐱𝑢subscript𝜏1𝐱𝜽2𝑝𝐱differential-d𝐱12missing-subexpressionsuperscriptsubscriptsuperscript𝑛superscript𝑓sgnsuperscript𝜽𝐱𝑢subscript𝜏1𝐱superscript𝜽2𝑝𝐱differential-d𝐱12italic-ϵ\displaystyle\begin{aligned} &\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{% \theta}-\mathbf{x}))-u(\tau_{1};\mathbf{x};\bm{\theta}))^{2}p(\mathbf{x})d% \mathbf{x}\big{)}^{1/2}\\ &+\big{(}\int_{\mathbb{R}^{n}}(f(\mathrm{sgn}(\bm{\theta}^{\ast}-\mathbf{x}))-% u(\tau_{1};\mathbf{x};\bm{\theta}^{\ast}))^{2}p(\mathbf{x})d\mathbf{x}\big{)}^% {1/2}<\epsilon.\end{aligned}start_ROW start_CELL end_CELL start_CELL ( ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ( roman_sgn ( bold_italic_θ - bold_x ) ) - italic_u ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_x ; bold_italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p ( bold_x ) italic_d bold_x ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ( roman_sgn ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_x ) ) - italic_u ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p ( bold_x ) italic_d bold_x ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT < italic_ϵ . end_CELL end_ROW (S30)

Recall Eq. (S27), we then have the error control for e(𝜽)𝑒𝜽e(\bm{\theta})italic_e ( bold_italic_θ ):

e1/2(𝜽)E1/2(τ1;𝜽)+ϵ(E(τ2;𝜽)+n2τ1τ2E(τ;𝜽)τ𝑑τ)1/2+ϵ.superscript𝑒12𝜽superscript𝐸12subscript𝜏1𝜽italic-ϵsuperscript𝐸subscript𝜏2𝜽𝑛2superscriptsubscriptsubscript𝜏1subscript𝜏2𝐸𝜏𝜽𝜏differential-d𝜏12italic-ϵ\displaystyle e^{1/2}(\bm{\theta})\leq E^{1/2}(\tau_{1};\bm{\theta})+\epsilon% \leq\big{(}E(\tau_{2};\bm{\theta})+\frac{n}{2}\int_{\tau_{1}}^{\tau_{2}}\frac{% E(\tau;\bm{\theta})}{\tau}d\tau\big{)}^{1/2}+\epsilon.italic_e start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( bold_italic_θ ) ≤ italic_E start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_italic_θ ) + italic_ϵ ≤ ( italic_E ( italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; bold_italic_θ ) + divide start_ARG italic_n end_ARG start_ARG 2 end_ARG ∫ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_E ( italic_τ ; bold_italic_θ ) end_ARG start_ARG italic_τ end_ARG italic_d italic_τ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT + italic_ϵ . (S31)

Noticed that

E(τ;𝜽)(f˘f)𝔼p(𝐱)[u(τ,𝐱;𝜽)u(τ,𝐱;𝜽)]=(f˘f)(u(τ,𝜽)u(τ,𝜽)),𝐸𝜏𝜽˘𝑓superscript𝑓subscript𝔼𝑝𝐱delimited-[]𝑢𝜏𝐱superscript𝜽𝑢𝜏𝐱𝜽˘𝑓superscript𝑓𝑢𝜏superscript𝜽𝑢𝜏𝜽\displaystyle E(\tau;\bm{\theta})\leq(\breve{f}-f^{\ast})\mathbb{E}_{p(\mathbf% {x})}[u(\tau,\mathbf{x};\bm{\theta}^{\ast})-u(\tau,\mathbf{x};\bm{\theta})]=(% \breve{f}-f^{\ast})(u(\tau,\bm{\theta}^{\ast})-u(\tau,\bm{\theta})),italic_E ( italic_τ ; bold_italic_θ ) ≤ ( over˘ start_ARG italic_f end_ARG - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ italic_u ( italic_τ , bold_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_u ( italic_τ , bold_x ; bold_italic_θ ) ] = ( over˘ start_ARG italic_f end_ARG - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( italic_u ( italic_τ , bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_u ( italic_τ , bold_italic_θ ) ) , (S32)

where f˘=max𝐬f(𝐬)˘𝑓subscript𝐬𝑓𝐬\breve{f}=\max_{\mathbf{s}}f(\mathbf{s})over˘ start_ARG italic_f end_ARG = roman_max start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT italic_f ( bold_s ), and we prove the theorem. ∎

2 Complexity analysis

We empirically estimate the time cost per iteration of our HeO for problems of different dimension n𝑛nitalic_n in Fig. S1.

Refer to caption
Figure S1: The time cost per iteration (ms) of the HeO framework increases linearly with the dimensionality of the problem. We present the results averaged over five tests, with error bars representing three standard deviations.

3 Parameter sensitivity analysis

We report the performance of our HeO on a wide range of parameter settings in Fig. S2.

Refer to caption
Figure S2: Mean result target function values (lower is better) of our HeO on three datasets (Ising, gka, and rudy) under various step sizes (γ𝛾\gammaitalic_γ) and iteration counts (T𝑇Titalic_T). The star denotes the settings used in Figure 3 of the main paper.

4 Relation to evolution strategy

We show that the update in Eq. (8) is equivalent to a variation of the evolution strategy (ES) which is a robust and powerful algorithm for solving black-box optimization problem [51]. In fact, we have

𝜽u(τ,𝜽)=𝜽𝔼p(𝐳)[h(𝜽+2τ𝐳)]=𝜽14πτnexp(14τ𝐳𝜽2)h(𝐳)𝑑𝐳=12τ14πτnexp(14τ𝐳𝜽2)h(𝐳)(𝐳𝜽)𝑑𝐳=12τ12πnexp(12𝐳2)h(𝜽+2τ𝐳)𝐳𝑑𝐳=12τ𝔼p(𝐳)[h(𝜽+2τ𝐳)𝐳].subscript𝜽𝑢𝜏𝜽absentsubscript𝜽subscript𝔼𝑝𝐳delimited-[]𝜽2𝜏𝐳missing-subexpressionabsentsubscript𝜽1superscript4𝜋𝜏𝑛14𝜏superscriptnorm𝐳𝜽2𝐳differential-d𝐳missing-subexpressionabsent12𝜏1superscript4𝜋𝜏𝑛14𝜏superscriptnorm𝐳𝜽2𝐳𝐳𝜽differential-d𝐳missing-subexpressionabsent12𝜏1superscript2𝜋𝑛12superscriptnorm𝐳2𝜽2𝜏𝐳𝐳differential-d𝐳missing-subexpressionabsent12𝜏subscript𝔼𝑝𝐳delimited-[]𝜽2𝜏𝐳𝐳\displaystyle\begin{aligned} \bigtriangledown_{\bm{\theta}}u(\tau,\bm{\theta})% &=\bigtriangledown_{\bm{\theta}}\mathbb{E}_{p(\mathbf{z})}[h(\bm{\theta}+\sqrt% {2\tau}\mathbf{z})]\\ &=\bigtriangledown_{\bm{\theta}}\int\frac{1}{\sqrt{4\pi\tau}^{n}}\exp(-\frac{1% }{4\tau}\left\|\mathbf{z}-\bm{\theta}\right\|^{2})h(\mathbf{z})d\mathbf{z}\\ &=\frac{1}{2\tau}\int\frac{1}{\sqrt{4\pi\tau}^{n}}\exp(-\frac{1}{4\tau}\left\|% \mathbf{z}-\bm{\theta}\right\|^{2})h(\mathbf{z})(\mathbf{z}-\bm{\theta})d% \mathbf{z}\\ &=\frac{1}{\sqrt{2\tau}}\int\frac{1}{\sqrt{2\pi}^{n}}\exp(-\frac{1}{2}\left\|% \mathbf{z}\right\|^{2})h(\bm{\theta}+\sqrt{2\tau}\mathbf{z})\mathbf{z}d\mathbf% {z}\\ &=\frac{1}{\sqrt{2\tau}}\mathbb{E}_{p(\mathbf{z})}[h(\bm{\theta}+\sqrt{2\tau}% \mathbf{z})\mathbf{z}].\end{aligned}start_ROW start_CELL ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_u ( italic_τ , bold_italic_θ ) end_CELL start_CELL = ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_h ( bold_italic_θ + square-root start_ARG 2 italic_τ end_ARG bold_z ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ▽ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∫ divide start_ARG 1 end_ARG start_ARG square-root start_ARG 4 italic_π italic_τ end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 4 italic_τ end_ARG ∥ bold_z - bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_h ( bold_z ) italic_d bold_z end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 italic_τ end_ARG ∫ divide start_ARG 1 end_ARG start_ARG square-root start_ARG 4 italic_π italic_τ end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 4 italic_τ end_ARG ∥ bold_z - bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_h ( bold_z ) ( bold_z - bold_italic_θ ) italic_d bold_z end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_τ end_ARG end_ARG ∫ divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_h ( bold_italic_θ + square-root start_ARG 2 italic_τ end_ARG bold_z ) bold_z italic_d bold_z end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_τ end_ARG end_ARG blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z ) end_POSTSUBSCRIPT [ italic_h ( bold_italic_θ + square-root start_ARG 2 italic_τ end_ARG bold_z ) bold_z ] . end_CELL end_ROW (S33)

The random vector 𝐳𝐳\mathbf{z}bold_z corresponds to the stochastic mutation and h(𝜽+2τ𝐳)𝜽2𝜏𝐳h(\bm{\theta}+\sqrt{2\tau}\mathbf{z})italic_h ( bold_italic_θ + square-root start_ARG 2 italic_τ end_ARG bold_z ) corresponds to the fitness in Alg. 1 in [63]. In ES, the standard deviation 2τ2𝜏\sqrt{2\tau}square-root start_ARG 2 italic_τ end_ARG is fixed in general, while in our HeO 2τt2subscript𝜏𝑡\sqrt{2\tau_{t}}square-root start_ARG 2 italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG is varying across time. As shown in Thm. 3, the varying τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in our HeO offers a theoretical benefit that it controls the upper bound of the optimization result. In contrast, a constant τ𝜏\tauitalic_τ in ES does not provide this benefit. Another difference between our HeO and ES is that we integral out 𝐳𝐳\mathbf{z}bold_z and using Eq. (11) to estimate the gradient, while in ES the gradient is estimated based on sampling 𝐳𝐳\mathbf{z}bold_z.

5 Relation to denoising diffusion models

Our approach, while bearing similarities to the DDM—a highly regarded and extensively utilized artificial generative model [24] that relies on the reverse diffusion process for data generation—differs in key aspects. The DDM necessitates reversing the diffusion process to generate data that from the target distribution. In contrast, it is unnecessary for our HeO to strictly adhering to the reverse time sequence τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the optimization process, as under different τ𝜏\tauitalic_τ, the function u(τ,𝜽)𝑢𝜏𝜽u(\tau,\bm{\theta})italic_u ( italic_τ , bold_italic_θ ) shares the same optima with that of the original problem h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ). This claim is corroborated in Fig. S3. as shown below, where HeO applying non-monotonic schedules of τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT still demonstrates superior performance. Hence, it is possible to explore diverse τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT schedules to further performance enhancement.

6 Cooperative optimization

Refer to caption
Figure S3: Verifying the cooperative optimization mechanism of HeO. The best cut value over 10 runs for each algorithm on the K-2000 problem [14] when the control parameters are randomly perturbed by different random perturbation level δ𝛿\deltaitalic_δ. The red dash line is the best cut value ever find.

Our study suggests that HeO exhibits a distinct cooperative optimization mechanism, setting it apart from current methodologies. Specifically, HeO benefits from the fact that target functions u(τ,𝜽)𝑢𝜏𝜽u(\tau,\bm{\theta})italic_u ( italic_τ , bold_italic_θ ) share the same optima as the original problem h(𝜽)𝜽h(\bm{\theta})italic_h ( bold_italic_θ ) for any τ>0𝜏0\tau>0italic_τ > 0. This characteristic allows the solver to transition between different τ𝜏\tauitalic_τ values during the optimization process, eliminating the necessity for a monotonic τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT schedule. In contrast, traditional methods such as those based on quantum adiabatic theory or bifurcation theory require a linear increase of a control parameter atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from 0 to 1. This parameter is analogous to τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the HeO framework.

To empirically verify the above claim, we introduce a random perturbation to the τ𝜏\tauitalic_τ schedule in Alg. 1, rendering it non-monotonic: τ~t=ct2τtsubscript~𝜏𝑡subscriptsuperscript𝑐2𝑡subscript𝜏𝑡\tilde{\tau}_{t}=c^{2}_{t}\tau_{t}over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is uniformly distributed on [1δ,1+δ]1𝛿1𝛿[1-\delta,1+\delta][ 1 - italic_δ , 1 + italic_δ ] with δ𝛿\deltaitalic_δ controlling the amplitude of the perturbation. For other methods based on quantum adiabatic theory or bifurcation theory, we correspondingly introduce the perturbation as a~t=clamp(ctat,0,1)subscript~𝑎𝑡clampsubscript𝑐𝑡subscript𝑎𝑡01\tilde{a}_{t}=\mathrm{clamp}(c_{t}a_{t},0,1)over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_clamp ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 0 , 1 ). If an algorithm contains cooperative optimization mechanism, it still works well even when the control parameter is not monotonic, as optimizing the transformed problems under different control parameters cooperatively contributes to optimizing the original problem. As shown in  Fig. S3, the performance of other methods are all dramatically deteriorated. In contrast, HeO shows no substantial decline in performance, corroborating that HeO employs a cooperative optimization mechanism.

7 Implementation details

All the experiments are conducted on a single NVIDIA RTX-3090 GPU (24GB) and an Intel(R) Xeon(R) Gold 6226R CPU (2.90GHz). For convenience, we denote τt=σtsubscript𝜏𝑡subscript𝜎𝑡\sqrt{\tau}_{t}=\sigma_{t}square-root start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Monte Carlo gradient estimation for combinatorial optimization. Based on Eq. (5), we construct a combinatorial optimization algorithm using the Monte Carlo gradient estimation (MCGE), as shown in Alg. 2. Noticed that we clamp the parameter 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the [0,1]01[0,1][ 0 , 1 ] for numerical stability; additionally, we binarize the 𝜽Tsubscript𝜽𝑇\bm{\theta}_{T}bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to obtain the optimized binary configuration 𝐬Tsubscript𝐬𝑇\mathbf{s}_{T}bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT in the end.

Algorithm 2 Monte Carlo gradient estimation for combinatorial optimization (MCGE)
Input: target function f()𝑓f(\cdot)italic_f ( ⋅ ), step size γ𝛾\gammaitalic_γ, sample number M𝑀Mitalic_M, iteration number T𝑇Titalic_T
initialize elements of 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as 0.50.50.50.5
for t=0𝑡0t=0italic_t = 0 to T1𝑇1T-1italic_T - 1 do
     sample 𝐬(m)i.i.d.p(𝐬|𝜽t),m=1,,Mformulae-sequencesubscriptsimilar-toformulae-sequence𝑖𝑖𝑑superscript𝐬𝑚𝑝conditional𝐬subscript𝜽𝑡𝑚1𝑀\mathbf{s}^{(m)}\sim_{i.i.d.}p(\mathbf{s}|\bm{\theta}_{t}),\quad m=1,\ldots,Mbold_s start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∼ start_POSTSUBSCRIPT italic_i . italic_i . italic_d . end_POSTSUBSCRIPT italic_p ( bold_s | bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_m = 1 , … , italic_M
     𝐠t1Mm=1Mf(𝐬(m))𝜽tlogp(𝐬(m)|𝜽t)subscript𝐠𝑡subscriptsubscript𝜽𝑡1𝑀superscriptsubscript𝑚1𝑀𝑓superscript𝐬𝑚𝑝conditionalsuperscript𝐬𝑚subscript𝜽𝑡\mathbf{g}_{t}\leftarrow\frac{1}{M}\sum_{m=1}^{M}f(\mathbf{s}^{(m)})% \bigtriangledown_{\bm{\theta}_{t}}\log p(\mathbf{s}^{(m)}|\bm{\theta}_{t})bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_f ( bold_s start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) ▽ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_s start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
     𝜽t+1clamp(𝜽tγ𝐠t,0,1)subscript𝜽𝑡1clampsubscript𝜽𝑡𝛾subscript𝐠𝑡01\bm{\theta}_{t+1}\leftarrow\mathrm{clamp}\big{(}\bm{\theta}_{t}-\gamma\mathbf{% g}_{t},0,1\big{)}bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← roman_clamp ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 0 , 1 )
end for
𝐬Tsgn(𝜽T0.5)subscript𝐬𝑇sgnsubscript𝜽𝑇0.5\mathbf{s}_{T}\leftarrow\mathrm{sgn}(\bm{\theta}_{T}-0.5)bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← roman_sgn ( bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 0.5 )
Output: binary configuration 𝐬Tsubscript𝐬𝑇\mathbf{s}_{T}bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT

Gradient descent with momentum. We provide HeO with momentum in Alg. 3.

Algorithm 3 Heat diffusion optimization (HeO) with momentum
Input: target function f()𝑓f(\cdot)italic_f ( ⋅ ), step size γ𝛾\gammaitalic_γ, momentum κ𝜅\kappaitalic_κ, σ𝜎\sigmaitalic_σ schedule {σt}subscript𝜎𝑡\{\sigma_{t}\}{ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, iteration number T𝑇Titalic_T, set 𝐠1=𝟎subscript𝐠10\mathbf{g}_{-1}=\mathbf{0}bold_g start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = bold_0
initialize elements of 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as 0.50.50.50.5
for t=0𝑡0t=0italic_t = 0 to T1𝑇1T-1italic_T - 1 do
     sample 𝐱tUnif[0,1]nsimilar-tosubscript𝐱𝑡Unifsuperscript01𝑛\mathbf{x}_{t}\sim\mathrm{Unif}[0,1]^{n}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_Unif [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
     𝐰t𝜽tf(erf(𝜽t𝐱tσt))subscript𝐰𝑡subscriptsubscript𝜽𝑡𝑓erfsubscript𝜽𝑡subscript𝐱𝑡subscript𝜎𝑡\mathbf{w}_{t}\leftarrow\bigtriangledown_{\bm{\theta}_{t}}f(\mathrm{erf}(\frac% {\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}}))bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ▽ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( roman_erf ( divide start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) )
     𝐠tκ𝐠t1+γ𝐰tsubscript𝐠𝑡𝜅subscript𝐠𝑡1𝛾subscript𝐰𝑡\mathbf{g}_{t}\leftarrow\kappa\mathbf{g}_{t-1}+\gamma\mathbf{w}_{t}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_κ bold_g start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_γ bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     𝜽t+1Proj(𝜽t+𝐠t)subscript𝜽𝑡1subscriptProjsubscript𝜽𝑡subscript𝐠𝑡\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}+% \mathbf{g}_{t}\big{)}bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← roman_Proj start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
end for
𝐬Tsgn(𝜽T0.5)subscript𝐬𝑇sgnsubscript𝜽𝑇0.5\mathbf{s}_{T}\leftarrow\mathrm{sgn}(\bm{\theta}_{T}-0.5)bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← roman_sgn ( bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 0.5 )
Output: binary configuration 𝐬Tsubscript𝐬𝑇\mathbf{s}_{T}bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT

Toy example. We set the momentum κ=0.9999𝜅0.9999\kappa=0.9999italic_κ = 0.9999, learning rate γ=2𝛾2\gamma=2italic_γ = 2 and iteration number T=5000𝑇5000T=5000italic_T = 5000 for HeO (Alg. 3). For MCGE without momentum, we set T=50000𝑇50000T=50000italic_T = 50000, we set γ=𝛾absent\gamma=italic_γ =1e-6, momentum κ=0𝜅0\kappa=0italic_κ = 0, and M=10𝑀10M=10italic_M = 10. For MCGE with momentum, we set momentum as 0.90.90.90.9.

Max-cut problem. For solving the max-cut problems from the Biq Mac Library [36], we set the steps T=5000𝑇5000T=5000italic_T = 5000 for all the algorithms. For HeO, we set γ=2𝛾2\gamma=2italic_γ = 2 and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT linearly decreases from 1111 to 00 for HeO, and we set momentum as zero. For LQA and SIM-CIM, we use the setting in [10]. For bSB, dSB, aSB, and CIM, we apply the settings in [64]. To reduce the fluctuations of the results, for each algorithm algalg\mathrm{alg}roman_alg and instant i𝑖iitalic_i, the relative loss is calculated as |Ci,algCmini|/|Cmini|superscript𝐶𝑖algsubscriptsuperscript𝐶𝑖superscriptsubscript𝐶𝑖|C^{i,\mathrm{alg}}-C^{i}_{\min}|/|C_{\min}^{i}|| italic_C start_POSTSUPERSCRIPT italic_i , roman_alg end_POSTSUPERSCRIPT - italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT | / | italic_C start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT |, where Ci,algsuperscript𝐶𝑖algC^{i,\mathrm{alg}}italic_C start_POSTSUPERSCRIPT italic_i , roman_alg end_POSTSUPERSCRIPT is the lowest output of the algorithm algalg\mathrm{alg}roman_alg on the instance i𝑖iitalic_i over 10 tries, and Cminisubscriptsuperscript𝐶𝑖C^{i}_{\min}italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT is the lowest output of all 7 the algorithm on the instance i𝑖iitalic_i. For each test, we estimate the mean and std from 10 runs.

3-SAT problem. For Boolean 3-satisfiability (3-SAT) problem, we set the momentum κ=0.9999𝜅0.9999\kappa=0.9999italic_κ = 0.9999, T=5000𝑇5000T=5000italic_T = 5000, γ=2𝛾2\gamma=2italic_γ = 2, and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT linearly decreases from 22\sqrt{2}square-root start_ARG 2 end_ARG to 00 for HeO. According to the empirical finding that high-order loss function leads to better results [16], we include higher order terms in the target function. However, since

(1cisi2)k=1cisi2superscript1subscript𝑐𝑖subscript𝑠𝑖2𝑘1subscript𝑐𝑖subscript𝑠𝑖2\displaystyle(\frac{1-c_{i}s_{i}}{2})^{k}=\frac{1-c_{i}s_{i}}{2}( divide start_ARG 1 - italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 1 - italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG (S34)

for any si,ci{1,1}subscript𝑠𝑖subscript𝑐𝑖11s_{i},c_{i}\in\{-1,1\}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { - 1 , 1 }, directly introducing higher order term in the f(𝐬)𝑓𝐬f(\mathbf{s})italic_f ( bold_s ) is useless. Instead, we adjust the gradient from

𝜽tf(erf(𝜽t𝐱tσt)))=𝜽t[h=1Hi=1312(1chi(erf(𝜽t𝐱tσt))hi)],\displaystyle\bigtriangledown_{\bm{\theta}_{t}}f(\mathrm{erf}(\frac{\bm{\theta% }_{t}-\mathbf{x}_{t}}{\sigma_{t}})))=\bigtriangledown_{\bm{\theta}_{t}}[\sum_{% h=1}^{H}\prod_{i=1}^{3}\frac{1}{2}(1-c_{h_{i}}\big{(}\mathrm{erf}(\frac{\bm{% \theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}})\big{)}_{h_{i}})],▽ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( roman_erf ( divide start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ) ) = ▽ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 - italic_c start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_erf ( divide start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ) start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ] , (S35)

to

𝜽t[h=1Hi=13(12(1chi(erf(𝜽t𝐱tσt))hi))4].subscriptsubscript𝜽𝑡delimited-[]superscriptsubscript1𝐻superscriptsubscriptproduct𝑖13superscript121subscript𝑐subscript𝑖subscripterfsubscript𝜽𝑡subscript𝐱𝑡subscript𝜎𝑡subscript𝑖4\displaystyle\bigtriangledown_{\bm{\theta}_{t}}[\sum_{h=1}^{H}\prod_{i=1}^{3}(% \frac{1}{2}(1-c_{h_{i}}\big{(}\mathrm{erf}(\frac{\bm{\theta}_{t}-\mathbf{x}_{t% }}{\sigma_{t}})\big{)}_{h_{i}}))^{4}].▽ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 - italic_c start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_erf ( divide start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ) start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ] . (S36)

We consider the 3-SAT problems with various number of variables from the SATLIB [37]. For each number of variables in the dataset, we consider the first 100 instance. We apply the same configuration of that in [16] for both 3-order solver and 2-order oscillation Ising machine solver. The energy gap of the 2-order solver is set as 1. For each test, we estimate the mean and std from 100 runs.

Algorithm 4 Heat diffusion optimization (HeO) for 3-SAT problem
Input: adjusted target function f(𝐬)=h=1Hi=13(1chishi2)4𝑓𝐬superscriptsubscript1𝐻superscriptsubscriptproduct𝑖13superscript1subscript𝑐subscript𝑖subscript𝑠subscript𝑖24f(\mathbf{s})=\sum_{h=1}^{H}\prod_{i=1}^{3}\big{(}\frac{1-c_{h_{i}}s_{h_{i}}}{% 2}\big{)}^{4}italic_f ( bold_s ) = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( divide start_ARG 1 - italic_c start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT (Eq. (S36)), step size γ𝛾\gammaitalic_γ, momentum κ𝜅\kappaitalic_κ, σ𝜎\sigmaitalic_σ schedule {σt}subscript𝜎𝑡\{\sigma_{t}\}{ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, iteration number T𝑇Titalic_T
initialize elements of 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as 0.50.50.50.5, set 𝐠1=𝟎subscript𝐠10\mathbf{g}_{-1}=\mathbf{0}bold_g start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = bold_0
for t=0𝑡0t=0italic_t = 0 to T1𝑇1T-1italic_T - 1 do
     sample 𝐱tUnif[0,1]nsimilar-tosubscript𝐱𝑡Unifsuperscript01𝑛\mathbf{x}_{t}\sim\mathrm{Unif}[0,1]^{n}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_Unif [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
     𝐰t𝜽tf(erf(𝜽t𝐱tσt))subscript𝐰𝑡subscriptsubscript𝜽𝑡𝑓erfsubscript𝜽𝑡subscript𝐱𝑡subscript𝜎𝑡\mathbf{w}_{t}\leftarrow\bigtriangledown_{\bm{\theta}_{t}}f(\mathrm{erf}(\frac% {\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}}))bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ▽ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( roman_erf ( divide start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) )
     𝐠tκ𝐠t1+γ𝐰tsubscript𝐠𝑡𝜅subscript𝐠𝑡1𝛾subscript𝐰𝑡\mathbf{g}_{t}\leftarrow\kappa\mathbf{g}_{t-1}+\gamma\mathbf{w}_{t}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_κ bold_g start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_γ bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     𝜽t+1Proj(𝜽tγ𝐠t)subscript𝜽𝑡1subscriptProjsubscript𝜽𝑡𝛾subscript𝐠𝑡\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}-% \gamma\mathbf{g}_{t}\big{)}bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← roman_Proj start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
end for
𝐬Tsgn(𝜽T0.5)subscript𝐬𝑇sgnsubscript𝜽𝑇0.5\mathbf{s}_{T}\leftarrow\mathrm{sgn}(\bm{\theta}_{T}-0.5)bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← roman_sgn ( bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 0.5 )
Output: binary configuration 𝐬Tsubscript𝐬𝑇\mathbf{s}_{T}bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT

Ternary-value neural network learning. We represent a ternary variable st{1,0,1}subscript𝑠t101s_{\mathrm{t}}\in\{-1,0,1\}italic_s start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ∈ { - 1 , 0 , 1 } as st=12(sb,1+sb,2)subscript𝑠t12subscript𝑠b1subscript𝑠b2s_{\mathrm{t}}=\frac{1}{2}(s_{\mathrm{b},1}+s_{\mathrm{b},2})italic_s start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_s start_POSTSUBSCRIPT roman_b , 1 end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT roman_b , 2 end_POSTSUBSCRIPT ) with two bits sb,1,sb,2{1,1}subscript𝑠b1subscript𝑠b211s_{\mathrm{b},1},s_{\mathrm{b},2}\in\{-1,1\}italic_s start_POSTSUBSCRIPT roman_b , 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT roman_b , 2 end_POSTSUBSCRIPT ∈ { - 1 , 1 }. In this way, each element of W𝑊Witalic_W can be represented as a function of 𝐬m×n×2𝐬superscript𝑚𝑛2\mathbf{s}\in\mathbb{R}^{m\times n\times 2}bold_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n × 2 end_POSTSUPERSCRIPT. We denote this relation as a matrix-value function W=W(𝐬)𝑊𝑊𝐬W=W(\mathbf{s})italic_W = italic_W ( bold_s )

Wij(𝐬)=12(sij,1+sij,2),i=1,,m,,j=1,,n,\displaystyle W_{ij}(\mathbf{s})=\frac{1}{2}(s_{ij,1}+s_{ij,2}),\quad i=1,% \ldots,m,\quad,j=1,\ldots,n,italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( bold_s ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_s start_POSTSUBSCRIPT italic_i italic_j , 1 end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_i italic_j , 2 end_POSTSUBSCRIPT ) , italic_i = 1 , … , italic_m , , italic_j = 1 , … , italic_n , (S37)

Based on the above encoding procedure, we design the training algorithm for based on HeO in Alg. 5. The input 𝐯𝐯\mathbf{v}bold_v of the dataset 𝒟𝒟\mathcal{D}caligraphic_D is generated from the uniform distribution on {1,0,1}nsuperscript101𝑛\{-1,0,1\}^{n}{ - 1 , 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. For HeO, we set T=10000𝑇10000T=10000italic_T = 10000, γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5, κ=0.999𝜅0.999\kappa=0.999italic_κ = 0.999, and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT linearly decreasing from 22\sqrt{2}square-root start_ARG 2 end_ARG to 00. For MCGE, we set T=10000𝑇10000T=10000italic_T = 10000, γ=1e7𝛾1𝑒7\gamma=1e-7italic_γ = 1 italic_e - 7, M=10𝑀10M=10italic_M = 10, and κ=0.9999𝜅0.9999\kappa=0.9999italic_κ = 0.9999. We empirically find that MCGE need high sampling number (M𝑀Mitalic_M) and low learning rate (γ𝛾\gammaitalic_γ) for stability, while this is not the case for HeO. For each test, we estimate the mean and std from 10 runs.

Algorithm 5 Heat diffusion optimization (HeO) for training ternary-value neural network
Input: dataset 𝒟𝒟\mathcal{D}caligraphic_D, step size γ𝛾\gammaitalic_γ, momentum κ𝜅\kappaitalic_κ, σ𝜎\sigmaitalic_σ schedule {σt}subscript𝜎𝑡\{\sigma_{t}\}{ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, iteration number T𝑇Titalic_T
initialize elements of 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as 1111, initialize elements of 𝜷~0subscript~𝜷0\tilde{\bm{\beta}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as 00, set 𝐠1=𝟎subscript𝐠10\mathbf{g}_{-1}=\mathbf{0}bold_g start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = bold_0
for t=0𝑡0t=0italic_t = 0 to T1𝑇1T-1italic_T - 1 do
     sample 𝐱tUnif[0,1]nsimilar-tosubscript𝐱𝑡Unifsuperscript01𝑛\mathbf{x}_{t}\sim\mathrm{Unif}[0,1]^{n}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_Unif [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
     WtW(erf(𝜽t𝐱tσt))subscript𝑊𝑡𝑊erfsubscript𝜽𝑡subscript𝐱𝑡subscript𝜎𝑡W_{t}\leftarrow W(\mathrm{erf}(\frac{\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t% }}))italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_W ( roman_erf ( divide start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ) (Eq. (S37))
     MSE1|𝒟|(𝐯,𝐲)𝒟𝚪(𝐯;Wt)𝐲2MSE1𝒟subscript𝐯𝐲𝒟superscriptnorm𝚪𝐯subscript𝑊𝑡𝐲2\mathrm{MSE}\leftarrow\frac{1}{\left|\mathcal{D}\right|}\sum_{(\mathbf{v},% \mathbf{y})\in\mathcal{D}}\left\|\bm{\Gamma}(\mathbf{v};W_{t})-\mathbf{y}% \right\|^{2}roman_MSE ← divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( bold_v , bold_y ) ∈ caligraphic_D end_POSTSUBSCRIPT ∥ bold_Γ ( bold_v ; italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
     𝐰t𝜽tMSEsubscript𝐰𝑡subscriptsubscript𝜽𝑡MSE\mathbf{w}_{t}\leftarrow\bigtriangledown_{\bm{\theta}_{t}}\mathrm{MSE}bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ▽ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_MSE
     𝐠tκ𝐠t1+γ𝐰tsubscript𝐠𝑡𝜅subscript𝐠𝑡1𝛾subscript𝐰𝑡\mathbf{g}_{t}\leftarrow\kappa\mathbf{g}_{t-1}+\gamma\mathbf{w}_{t}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_κ bold_g start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_γ bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     𝜽t+1Proj(𝜽t𝐠t)subscript𝜽𝑡1subscriptProjsubscript𝜽𝑡subscript𝐠𝑡\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}-% \mathbf{g}_{t}\big{)}bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← roman_Proj start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
end for
𝐬T=sgn(θT0.5)subscript𝐬𝑇sgnsubscript𝜃𝑇0.5\mathbf{s}_{T}=\mathrm{sgn}({\theta}_{T}-0.5)bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = roman_sgn ( italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 0.5 )
Output: WT=W(𝐬T)subscript𝑊𝑇𝑊subscript𝐬𝑇W_{T}=W(\mathbf{s}_{T})italic_W start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_W ( bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )

Variable selection problem. We construct an algorithm for variable selection problem based on HeO as shown in Alg. 6, where the function f(𝐬,𝜷)𝑓𝐬𝜷f(\mathbf{s},\bm{\beta})italic_f ( bold_s , bold_italic_β ) is defined in Eq. (17).

Algorithm 6 Heat diffusion optimization (HeO) for linear regression variable selection
Input: dataset 𝒟𝒟\mathcal{D}caligraphic_D, step size γ𝛾\gammaitalic_γ, momentum κ𝜅\kappaitalic_κ, σ𝜎\sigmaitalic_σ schedule {σt}subscript𝜎𝑡\{\sigma_{t}\}{ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, iteration number T𝑇Titalic_T
initialize elements of 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as 1111, initialize elements of 𝜷~0subscript~𝜷0\tilde{\bm{\beta}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as 00, set 𝐠1β=𝐠1θ=𝟎subscriptsuperscript𝐠𝛽1subscriptsuperscript𝐠𝜃10\mathbf{g}^{\beta}_{-1}=\mathbf{g}^{\theta}_{-1}=\mathbf{0}bold_g start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = bold_g start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = bold_0.
for t=0𝑡0t=0italic_t = 0 to T1𝑇1T-1italic_T - 1 do
     sample 𝐱tθUnif[0,1]nsimilar-tosubscriptsuperscript𝐱𝜃𝑡Unifsuperscript01𝑛\mathbf{x}^{\theta}_{t}\sim\mathrm{Unif}[0,1]^{n}bold_x start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_Unif [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
     𝐰tβ𝜷~tf(erf(𝜽t𝐱tσt),𝜷)subscriptsuperscript𝐰𝛽𝑡subscriptsubscript~𝜷𝑡𝑓erfsubscript𝜽𝑡subscript𝐱𝑡subscript𝜎𝑡𝜷\mathbf{w}^{\beta}_{t}\leftarrow\bigtriangledown_{\tilde{\bm{\beta}}_{t}}f(% \mathrm{erf}(\frac{\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}}),\bm{\beta})bold_w start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ▽ start_POSTSUBSCRIPT over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( roman_erf ( divide start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) , bold_italic_β )
     𝐠tβκ𝐠t1β+γT𝐰tβsubscriptsuperscript𝐠𝛽𝑡𝜅subscriptsuperscript𝐠𝛽𝑡1𝛾𝑇subscriptsuperscript𝐰𝛽𝑡\mathbf{g}^{\beta}_{t}\leftarrow\kappa\mathbf{g}^{\beta}_{t-1}+\frac{\gamma}{T% }\mathbf{w}^{\beta}_{t}bold_g start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_κ bold_g start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG italic_γ end_ARG start_ARG italic_T end_ARG bold_w start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     𝜷~t+1𝜷~t𝐠tβsubscript~𝜷𝑡1subscript~𝜷𝑡subscriptsuperscript𝐠𝛽𝑡\tilde{\bm{\beta}}_{t+1}\leftarrow\tilde{\bm{\beta}}_{t}-\mathbf{g}^{\beta}_{t}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_g start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     𝐰tθ𝜽tf(erf(𝜽t𝐱tσt),𝜷)subscriptsuperscript𝐰𝜃𝑡subscriptsubscript𝜽𝑡𝑓erfsubscript𝜽𝑡subscript𝐱𝑡subscript𝜎𝑡𝜷\mathbf{w}^{\theta}_{t}\leftarrow\bigtriangledown_{\bm{\theta}_{t}}f(\mathrm{% erf}(\frac{\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}}),\bm{\beta})bold_w start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ▽ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( roman_erf ( divide start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) , bold_italic_β )
     𝐠tθκ𝐠t1θ+γ𝐰tθsubscriptsuperscript𝐠𝜃𝑡𝜅subscriptsuperscript𝐠𝜃𝑡1𝛾subscriptsuperscript𝐰𝜃𝑡\mathbf{g}^{\theta}_{t}\leftarrow\kappa\mathbf{g}^{\theta}_{t-1}+\gamma\mathbf% {w}^{\theta}_{t}bold_g start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_κ bold_g start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_γ bold_w start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     𝜽t+1Proj(𝜽tγ𝐠tθ)subscript𝜽𝑡1subscriptProjsubscript𝜽𝑡𝛾subscriptsuperscript𝐠𝜃𝑡\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}-% \gamma\mathbf{g}^{\theta}_{t}\big{)}bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← roman_Proj start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ bold_g start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
end for
𝐬Tsgn(𝜽T0.5)subscript𝐬𝑇sgnsubscript𝜽𝑇0.5\mathbf{s}_{T}\leftarrow\mathrm{sgn}(\bm{\theta}_{T}-0.5)bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← roman_sgn ( bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 0.5 )
Output: 𝐬Tsubscript𝐬𝑇\mathbf{s}_{T}bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT

We randomly generate 400400400400-dimensional datasets with 1000100010001000 training samples. The input 𝐯𝐯\mathbf{v}bold_v is sampled from a standard Gaussian distribution. The element of the ground-truth coefficient 𝜷superscript𝜷\bm{\beta}^{\ast}bold_italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is uniformly distributed on [2,1][1,2]2112[-2,-1]\cup[1,2][ - 2 , - 1 ] ∪ [ 1 , 2 ], and each element haves 1q1𝑞1-q1 - italic_q probability of being set as zero and thus should be ignored for the prediction. We apply a five-fold cross-validation for all of methods. For our HeO, we set T=2000𝑇2000T=2000italic_T = 2000 and γ=1𝛾1\gamma=1italic_γ = 1, κ=0.999𝜅0.999\kappa=0.999italic_κ = 0.999. We generate an ensemble of indicators 𝐬𝐬\mathbf{s}bold_s of size 100. For each 𝐬𝐬\mathbf{s}bold_s in the ensemble, we fit a linear model by implementing an OLS on the non-zero variables indicated by 𝐬𝐬\mathbf{s}bold_s and calculate the average MSE loss of the linear model on the cross-validation sets. We then select the linear model with lowest MSE on the validate sets as the output linear model. For Lasso and L0.5subscript𝐿0.5L_{0.5}italic_L start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT regression, we follow the implementation in [41] with 10 iterations. the regularization parameter is selected by cross-validation from {0.05,0.1,0.2,0.5,1,2,5}0.050.10.20.5125\{0.05,0.1,0.2,0.5,1,2,5\}{ 0.05 , 0.1 , 0.2 , 0.5 , 1 , 2 , 5 }. For each test, we estimate the mean and std from 10 runs.

Table S2: The attributes of the real world graphs and the parameter settings of HeO.
graph name |V|𝑉|V|| italic_V | |E|𝐸|E|| italic_E | T𝑇Titalic_T γ𝛾\gammaitalic_γ λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
tech-RL-caida 190914 607610 200 2.5 linearly from 0 to 2.5 linearly from 22\sqrt{2}square-root start_ARG 2 end_ARG to 0
soc-youtube 495957 1936748 200 2.5 linearly from 0 to 2.5
inf-roadNet-PA 1087562 1541514 200 2.5 linearly from 0 to 7.5
inf-roadNet-CA 1957027 2760388 200 5 linearly from 0 to 7.5
socfb-B-anon 2937612 20959854 50 2.5 linearly from 0 to 5
socfb-A-anon 3097165 23667394 50 2.5 linearly from 0 to 5
socfb-uci-uni 58790782 92208195 50 2.5 linearly from 0 to 5

Minimum vertex cover problem. For constrained binary optimization

min𝐬{1,1}nf(𝐬),subscript𝐬superscript11𝑛𝑓𝐬\displaystyle\min_{\mathbf{s}\in\{\-1,1\}^{n}}f(\mathbf{s}),roman_min start_POSTSUBSCRIPT bold_s ∈ { 1 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( bold_s ) , (S38)
gk(𝐬)0,,k=1,,K,\displaystyle g_{k}(\mathbf{s})\leq 0,\quad,k=1,\ldots,K,italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_s ) ≤ 0 , , italic_k = 1 , … , italic_K , (S39)

we put the constrains as the penalty function with coefficient λ𝜆\lambdaitalic_λ into the target function

fλ(𝐬)=f(𝐬)λk=1Kgk(𝐬),subscript𝑓𝜆𝐬𝑓𝐬𝜆superscriptsubscript𝑘1𝐾subscript𝑔𝑘𝐬\displaystyle f_{\lambda}(\mathbf{s})=f(\mathbf{s})-\lambda\sum_{k=1}^{K}g_{k}% (\mathbf{s}),italic_f start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( bold_s ) = italic_f ( bold_s ) - italic_λ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_s ) , (S40)

and the corresponding algorithm is shown in Alg. 7.

Algorithm 7 Heat diffusion optimization (HeO) for constrained binary optimization
Input: target function with penalty fλsubscript𝑓𝜆f_{\lambda}italic_f start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT, step size γ𝛾\gammaitalic_γ, σ𝜎\sigmaitalic_σ schedule {σt}subscript𝜎𝑡\{\sigma_{t}\}{ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, penalty coefficients schedule {λt}subscript𝜆𝑡\{{\lambda}_{t}\}{ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, iteration number T𝑇Titalic_T
initialize elements of 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as 0.50.50.50.5, set 𝐠1=𝟎subscript𝐠10\mathbf{g}_{-1}=\mathbf{0}bold_g start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = bold_0.
for t=0𝑡0t=0italic_t = 0 to T1𝑇1T-1italic_T - 1 do
     sample 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from Unif[0,1]nUnifsuperscript01𝑛\mathrm{Unif}[0,1]^{n}roman_Unif [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
     𝐰t𝜽tfλt(𝜽t𝐱tσt)subscript𝐰𝑡subscriptsubscript𝜽𝑡subscript𝑓subscript𝜆𝑡subscript𝜽𝑡subscript𝐱𝑡subscript𝜎𝑡\mathbf{w}_{t}\leftarrow\bigtriangledown_{{\bm{\theta}}_{t}}f_{\lambda_{t}}(% \frac{\bm{\theta}_{t}-\mathbf{x}_{t}}{\sigma_{t}})bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ▽ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( divide start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG )
     𝐠tκ𝐠t1+γ𝐰tsubscript𝐠𝑡𝜅subscript𝐠𝑡1𝛾subscript𝐰𝑡\mathbf{g}_{t}\leftarrow\kappa\mathbf{g}_{t-1}+\gamma\mathbf{w}_{t}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_κ bold_g start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_γ bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     𝜽t+1Proj(𝜽tγ𝐠t)subscript𝜽𝑡1subscriptProjsubscript𝜽𝑡𝛾subscript𝐠𝑡\bm{\theta}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{I}}\big{(}\bm{\theta}_{t}-% \gamma\mathbf{g}_{t}\big{)}bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← roman_Proj start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
end for
𝐬Tsgn(𝜽T0.5)subscript𝐬𝑇sgnsubscript𝜽𝑇0.5\mathbf{s}_{T}\leftarrow\mathrm{sgn}(\bm{\theta}_{T}-0.5)bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← roman_sgn ( bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 0.5 )
Output: 𝐬Tsubscript𝐬𝑇\mathbf{s}_{T}bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
Algorithm 8 Refinement of the result of MVC
Input: the result of HeO 𝐬Tsubscript𝐬𝑇\mathbf{s}_{T}bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
for i=1𝑖1i=1italic_i = 1 to n𝑛nitalic_n do
     set sT,isubscript𝑠𝑇𝑖s_{T,i}italic_s start_POSTSUBSCRIPT italic_T , italic_i end_POSTSUBSCRIPT as 00 if 𝐬Tsubscript𝐬𝑇\mathbf{s}_{T}bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is still a vertex cover
end for
Output: 𝐬Tsubscript𝐬𝑇\mathbf{s}_{T}bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT

We implement the HeO on a single NVIDIA RTX 3090 GPU for all the minimum vertex cover (MVC) experiments. Let 𝐬𝐬\mathbf{s}bold_s be the configuration to be optimized, in which sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is 1111 if we select i𝑖iitalic_i-th vertex into 𝒱csubscript𝒱𝑐\mathcal{V}_{c}caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, otherwise we do not select i𝑖iitalic_i-vertex into 𝒱csubscript𝒱𝑐\mathcal{V}_{c}caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The target function to be minimize is the size of 𝒱csubscript𝒱𝑐\mathcal{V}_{c}caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT: f(𝐬)=i=1nsi+12𝑓𝐬superscriptsubscript𝑖1𝑛subscript𝑠𝑖12f(\mathbf{s})=\sum_{i=1}^{n}\frac{s_{i}+1}{2}italic_f ( bold_s ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 end_ARG start_ARG 2 end_ARG, and the constrains are

gij(𝐬)=(1si+12)(1sj+12)=0,i,j,eij,formulae-sequencesubscript𝑔𝑖𝑗𝐬1subscript𝑠𝑖121subscript𝑠𝑗120for-all𝑖𝑗subscript𝑒𝑖𝑗\displaystyle g_{ij}(\mathbf{s})=(1-\frac{s_{i}+1}{2})(1-\frac{s_{j}+1}{2})=0,% \quad\forall i,j,e_{ij}\in\mathcal{E},italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( bold_s ) = ( 1 - divide start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 end_ARG start_ARG 2 end_ARG ) ( 1 - divide start_ARG italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + 1 end_ARG start_ARG 2 end_ARG ) = 0 , ∀ italic_i , italic_j , italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_E , (S41)

where eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represent the edge connecting the i𝑖iitalic_i and j𝑗jitalic_j-th vertices. We construct the target function fλ(𝐬)=f(𝐬)+λeijgij(𝐬)subscript𝑓𝜆𝐬𝑓𝐬𝜆subscriptsubscript𝑒𝑖𝑗subscript𝑔𝑖𝑗𝐬f_{\lambda}(\mathbf{s})=f(\mathbf{s})+\lambda\sum_{e_{ij}\in\mathcal{E}}g_{ij}% (\mathbf{s})italic_f start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( bold_s ) = italic_f ( bold_s ) + italic_λ ∑ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_E end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( bold_s ). The term with the positive factor λ𝜆\lambdaitalic_λ penalizes vector 𝐬𝐬\mathbf{s}bold_s when there are uncovered edges. After the HeO outputs the result 𝐬Tsubscript𝐬𝑇\mathbf{s}_{T}bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, we empirically find that its subset may also form a vertex cover for the graph 𝒢𝒢\mathcal{G}caligraphic_G, so we implement the following refinement on the result 𝐬Tsubscript𝐬𝑇\mathbf{s}_{T}bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, as shown in Alg. 8. We report the vertex number, edge number and settings of HeO in Tab. S2. For FastVC, we follow the settings in [45] and use its codebase, and set the cut-off time as the same as the time cost of HeO. For each test, we estimate the mean and std from 10 runs.