This is the html version of the file https://arxiv.org/abs/2012.13349.
Google automatically generates html versions of documents as we crawl the web.
arXiv:2012.13349v3 [math.OC] 29 Jul 2021
Page 1
Solving Mixed Integer Programs Using Neural
Networks
Vinod Nair∗†1, Sergey Bartunov∗1, Felix Gimeno∗1, Ingrid von Glehn∗1, Pawel Lichocki∗2, Ivan
Lobov∗1, Brendan O’Donoghue∗1, Nicolas Sonnerat∗1, Christian Tjandraatmadja∗2, Pengming
Wang∗1, Ravichandra Addanki1, Tharindi Hapuarachchi1, Thomas Keck1, James Keeling1,
Pushmeet Kohli1, Ira Ktena1, Yujia Li1, Oriol Vinyals1, Yori Zwols1
1DeepMind, 2Google Forschung
Mixed Integer Programming (MIP) solvers rely on an array of sophisticated heuristics developed with
decades of research to solve large-scale MIP instances encountered in practice. Machine learning offers to
automatically construct better heuristics from data by exploiting shared structure among instances in the
data. This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint
variable assignment, and bounding the gap in objective value between that assignment and an optimal one.
Our approach constructs two corresponding neural network-based components, Neural Diving and Neural
Branching, to use in a base MIP solver such as SCIP. Neural Diving learns a deep neural network to generate
multiple partial assignments for its integer variables, and the resulting smaller MIPs for un-assigned variables
are solved with SCIP to construct high quality joint assignments. Neural Branching learns a deep neural
network to make variable selection decisions in branch-and-bound to bound the objective value gap with
a small tree. This is done by imitating a new variant of Full Strong Branching we propose that scales to
large instances using GPUs. We evaluate our approach on diverse real-world datasets, including two Google
production datasets and MIPLIB, by training separate neural networks on each. Most instances in all the
datasets combined have 103 − 106 variables and constraints after presolve, which is significantly larger than
previous learning approaches. Comparing solvers with respect to primal-dual gap averaged over a held-out
set of instances at large time limits, the learning-augmented SCIP achieves 1.5×, 2×, and 104× better gap
on three out of the five datasets with the largest MIPs, achieves a 10% gap 5× faster on a fourth one, and
matches SCIP on the fifth. To the best of our knowledge, ours is the first learning approach to demonstrate
such large improvements over SCIP on both large-scale real-world application datasets and MIPLIB.
Key words: deep learning, mixed integer programming, discrete optimization, graph networks
History : First version December 2020.
1. Introduction
Mixed Integer Programs (MIPs) are a class of NP-hard problems where the goal is to minimize a
linear objective subject to linear constraints, with some or all of the variables constrained to be
integer-valued (Wolsey 1998, Karp 1972). They have enjoyed widespread adoption in a broad range
Equal contributors, listed alphabetically by last name after the lead author.
Corresponding author, email: [email protected].
1
arXiv:2012.13349v3 [math.OC] 29 Jul 2021

Page 2
2
Generative
Modelling
Distribution of MIPs
Training
MIPs
Test
MIPs
Imitation
Lernen
Neural MIP Solver
Partial assign. 1
Partial assign. K
Solve
sub-MIP 1
...
Solve
sub-MIP K
High-quality
assignment
Optimality
gap
Tree search
over integer
assignments
Neural Branching
Neural Diving
xi = 0
xi = 1
xj = 1
xj = 0
Assignment
Predictor
Branching
policy
MIP to
solve
Training
Figure 1
Our approach constructs two neural network-based components to use in a MIP solver, Neural
Diving and Neural Branching, and combine them to produce a Neural Solver customized to a
given MIP dataset.
of applications such as capacity planning, resource allocation, bin packing, etc. (Taha 2014, Jünger
et al. 2009, Sierksma and Zwols 2015). Significant research and engineering effort has gone into
developing practical solvers, such as SCIP (Gamrath et al. 2020), CPLEX (IBM ILOG CPLEX
2019), Gurobi (Gurobi Optimization 2020), and Xpress (FICO Xpress 2020). These solvers use
sophisticated heuristics to direct the search process for solving a MIP. A solver’s performance on
a given application depends crucially on how well its heuristics suit that application.
In this paper we show that machine learning can be used to automatically construct effective
heuristics from a dataset of MIP instances. A compelling use case for this arises often in practice
where an application requires solving a large set of instances of the same high-level semantic prob-
lem with different problem parameters. Examples of such “homogenous” datasets in this paper are
1) optimizing the choice of power plants on an electric grid to meet demand (O’Neill 2017), where
grid topology remains the same while demand, renewable generation, etc. vary across instances,
and 2) solving a packing problem in Google’s production system where the semantics of “items”
and “bins” to be packed remain mostly the same but their sizes fluctuate across instances. Even
a “heterogeneous” dataset that combines many semantically different problems, such as MIPLIB
2017 (Gleixner et al. 2019), can have structure across instances that can be used to learn better
heuristics, as we shall show. Off-the-shelf MIP solvers cannot automatically construct heuristics
to exploit such structure. In challenging applications users may rely on an expert to hand-design
such heuristics, or forego potentially large performance improvements. Machine learning offers the
possibility of large improvements without needing application-specific expertise.
We demonstrate that machine learning can construct heuristics customized to a given dataset
that significantly outperform the classical ones used in a MIP solver, specifically the state-of-the-
art non-commercial solver SCIP 7.0.1 (Gamrath et al. 2020). Our approach applies learning to the

Page 3
3
two key sub-tasks of a MIP solver: a) output an assignment of values to all variables that satisfy
the constraints (if such an assignment exists), and b) prove a bound for the gap in objective value
between that assignment and an optimal one. They define the main components of our approach,
Neural Diving and Neural Branching (see figure 1).
Neural Diving: This component finds high quality joint variable assignments. It is an instance
of a primal heuristic (Berthold 2006), a class of search heuristics that have been identified as key to
effective MIP solvers (Berthold 2013). We train a deep neural network to produce multiple partial
assignments of the integer variables of the input MIP. The remaining unassigned variables define
smaller ‘sub-MIPs’, which are solved using an off-the-shelf MIP solver (e.g., SCIP) to complete the
assignments. The sub-MIPs can be solved in parallel if the compute budget allows. The model is
trained to give higher probability to feasible assignments that have better objective values, using
training examples collected offline with an off-the-shelf solver. It learns on all available feasible
assignments instead of only the optimal ones, and does not necessarily require optimal assignments,
which can be expensive to collect.
Neural Branching: This component is mainly used to bound the gap between the objective
value of the best assignment and an optimal one. MIP solvers use a form of tree search over
integer variables called branch-and-bound (Land and Doig 1960) (see section 2), which progressively
tightens the bound and helps find feasible assignments. The choice of the variable to branch on at
a given node is a key factor in determining search efficiency (Achterberg et al. 2005, Glankwamdee
and Linderoth 2011, Schubert 2017, Yang et al. 2019). We train a deep neural network policy to
imitate choices made by an expert policy. The imitation target is a well-known heuristic called
Full Strong Branching (FSB), which has been empirically shown to produce small search trees
(Achterberg et al. 2005). While it is often too computationally expensive for practical MIP solving,
it can still be used to generate imitation learning data offline as a slow and expensive one-time
computation. Once trained, the neural network is able to approximate the expert at test time at a
fraction of the computational cost. A CPU-based implementation of FSB can be too expensive on
large-scale MIPs even for offline data generation. We develop a variant of FSB using the alternating
directions method of multipliers (ADMM) (Boyd et al. 2011) that scales to large-scale MIPs by
performing the required computation in a batch manner on GPU.
We evaluate our approach on a diverse set of datasets containing large-scale MIPs from real-
world applications, including two from Google’s production systems, as well as MIPLIB (Gleixner
et al. 2019), which is a heterogeneous dataset and a standard benchmark. Most of the combined
set of MIPs from all datasets have 103–106 variables and constraints after presolve (see figure 4 in
section 4), which is significantly larger than earlier works (Gasse et al. 2019, Ding et al. 2020). Once
Neural Diving and Neural Branching models are trained on a given dataset, they are integrated

Page 4
4
> 5x faster
> 1.5x better gap
> 104x better gap
> 2x better gap
Figure 2
Main result of the paper: our approach, Neural Branching + Neural Diving, matches or outper-
forms SCIP with respect to the primal-dual gap (Berthold 2006), averaged on held-out instances,
as a function of running time (referred to as calibrated time, see section 5) for the five datasets
in our evaluation with the largest MIPs. Note the log scale on both axes.
into SCIP to form a “Neural Solver” specifically for that dataset. The baseline is SCIP with its
emphasis parameters tuned by grid search on each dataset separately, which we refer to as Tuned
SCIP. Comparing the Neural Solver and Tuned SCIP with respect to the primal-dual gap (Berthold
2006) averaged over a held-out set of instances in figure 2, the Neural Solver provides either
significantly better gap in the same running time, or the same gap in significantly less time, on
four out of the five datasets in our evaluation with the largest MIPs, while matching Tuned SCIP’s
performance on the fifth. To the best of our knowledge, this is the first work to demonstrate such
large improvements over SCIP on both large-scale real-world application datasets and MIPLIB
using machine learning.
Tuned SCIP is the baseline we compare to since we use SCIP as the base solver for integrating
learned heuristics. As a base solver SCIP provides a) extensive access to its internal state for inte-
grating learned models, and b) permissive licensing that enables large-scale evaluation by running a
large number of solver instances in parallel. We do not have access to commercial solvers with these
features, which makes a fair comparison to them infeasible. We have done a partial comparison of
Gurobi versus Neural Diving alone, with Gurobi as its sub-MIP solver, on two datasets (see section
6). Comparing the primal gap (Berthold 2006) averaged over a held-out set of instances, Neural
Diving with parallel sub-MIP solving reaches 1% average primal gap in 3× and 3.6× less time than

Page 5
5
Gurobi on the two datasets. We have also applied a modified version of Neural Diving to a subset of
‘open’ instances in MIPLIB to find three new best known assignments, beating commercial solvers.
Several earlier works have focused on learning primal heuristics (Khalil et al. 2017b, Ding et al.
2020, Hendel 2018, Hottung and Tierney 2019, Xavier et al. 2020, Song et al. 2020, Addanki
et al. 2020). Unlike them, Neural Diving poses the problem of predicting variable assignments as
a generative modelling problem, which provides a principled way to learn on all available feasible
assignments and also to generate partial assignments at test time. Several works have also looked
at learning a branching policy (He et al. 2014, Khalil et al. 2016, Alvarez et al. 2017, Balcan et al.
2018, Gasse et al. 2019, Yang et al. 2020, Zarpellon et al. 2020, Gupta et al. 2020). Many of these
focus specifically on learning to imitate FSB as we do (Khalil et al. 2016, Alvarez et al. 2017, Gasse
et al. 2019, Gupta et al. 2020). Unlike them, Neural Branching uses a more scalable approach to
computing the target policy using GPUs, which allows it to generate more imitation data from
larger instances in the same time limit than a CPU-based FSB implementation. We also go beyond
earlier works that study learning individual heuristics in isolation by combining a learned primal
heuristic and a learned branching policy in a solver to achieve significantly better performance on
large-scale real-world application datasets and MIPLIB.
1.1. Contributions
1. We propose Neural Diving (section 6), a new learning-based approach to generating high-
quality joint variable assignments for a MIP. On homogeneous datasets, Neural Diving achieves an
average primal gap of 1% on held out instances 3-10× faster than Tuned SCIP. On one dataset
Tuned SCIP does not reach 1% average primal gap within the time limit, while Neural Diving
does.
2. We propose Neural Branching which learns a branching policy to use in the branch-and-
bound algorithm by imitating a new scalable expert policy based on ADMM (section 7). On two
of the datasets used in our evaluation for which FSB is slow due to instance sizes (e.g., with > 105
variables) or high per-iteration times, the ADMM expert generates 1.4× and 12× more training
data in the same running time. The learned policy significantly outperforms SCIP’s branching
heuristic on four datasets with 2-20× better average dual gap (Berthold 2006) on held out instances
at large time limits, and has comparable performance on the other rest.
3. We combine Neural Diving and Neural Branching (section 8) to attain significantly better
performance than SCIP with respect to the average primal-dual gap on four out of the five datasets
with the largest MIPs, while matching its performance on the fifth one.
In addition, we have open-sourced a dataset for the application of Neural Network Verification
(sections 4, 12.6), which we hope will help further research on new learning techniques for MIPs.

Page 6
6
2. Integer programming background
In this section we present some basic integer programming concepts and techniques relevant for
the paper. A mixed integer linear program has the form
minimize c x
subject to Ax ≤ b
l ≤ x ≤ u
xi ∈ Z, i ∈ I
(1)
over variables x ∈ Rn, where A ∈ Rm×n, b ∈ Rm, c ∈ Rn , l ∈ (R∪ {−∞})n, and u ∈ (R∪ {∞})n are
given data, and I⊆{1,...,n} refers to the index set of integer variables. We allow l and u to take
infinite values to indicate that the associated variable has no lower or upper bound respectively,
for those indices. A (complete) assignment is any point x ∈ Rn. A partial assignment is when we
have fixed some, but not all, of the variable values. A feasible assignment is an assignment that
satisfies all the constraints in problem (1), and an optimal assignment, or a solution, is a feasible
assignment that also minimizes the objective (Boyd and Vandenberghe 2004, §1.1).
2.1. Linear programming relaxation
If we remove the integer constraints in problem (1) then it becomes a linear program (LP), which
is convex and can be solved efficiently (Boyd and Vandenberghe 2004). The optimal value of
the relaxed problem is guaranteed to be a lower bound for the original problem, since removing
constraints can only expand the feasible set. If the optimal solution to the LP satisfies the integral
constraints then it is guaranteed to be optimal for the original problem. We shall refer to any lower
bound (in our case found via linear programming) as a dual bound.
2.2. Branch-and-bound
A common procedure to solve MIPs is to recursively build a search tree, with partial integer
assignments at each node, and use the information gathered at each node to eventually converge
on an optimal (or close to optimal) assignment (Land and Doig 1960, Lawler and Wood 1966).
At every step we must choose a leaf node from which to ‘branch’. At this node we can solve the
LP relaxation, where we constrain the ranges of the fixed variables at that node to their assigned
values. This gives us a valid lower bound on the true objective value of any further child nodes
from this node. If this bound is larger than a known feasible assignment, then we can safely prune
this part of the search tree since no optima for the original problem can exist in the subtree from
that node. If we decide to expand this node, then we must choose a variable to branch on from
the set of unfixed variables at that node. Once a variable is selected, we take a branching step,

Page 7
7
which adds two child nodes to the current node. One node has the domain of the selected variable
constrained to be greater or equal to the ceiling of its LP relaxation value at the parent node. The
other node has the selected variable’s domain constrained to be less than or equal to the floor of
its LP relaxation value. The tree is updated, and the procedure begins again. This algorithm is
referred to as branch-and-bound. Linear programming is a main workhorse of this procedure, both
to derive the dual bounds at every node and to decide the branching variable for some of the more
sophisticated branching heuristics. In theory the size of the search tree can be exponential in the
input size of the problem, but in many cases the search trees can be small and it is an area of
established and active research to come up with node selection and variable selection heuristics
that keep the tree as small as possible.
2.3. Primal heuristics
A primal heuristic is a method that attempts to find a feasible, but not necessarily optimal, variable
assignment (Berthold 2006). Any such feasible assignment provides a guaranteed upper bound on
the optimal value of the MIP. Any such bound found at any point during a MIP solve is called a
primal bound. Primal heuristics can be run independently of branch-and-bound, but they can also
be run within a branch-and-bound tree and attempt to find a feasible assignment of the unfixed
variables from a given node in the search tree. Better primal heuristics that produce lower primal
bounds allow the branch-and-bound procedure to prune more of the tree. Simple rounding, where
the fractional variables are rounded to integer values (possibly randomly) is an example of a primal
heuristic. Another case is diving which attempts to find a feasible solution by exploring the search
tree from a given node in a depth-first manner (Berthold 2006, Eckstein and Nediak 2007).
2.4. Primal-dual gap
When running branch-and-bound we keep track of the global primal bound (the minimum objective
value of any feasible assignment) and the global dual bound (the minimum dual bound across all
leaves of the branch-and-bound tree). We can combine these to define a sub-optimality gap
gap = global primal bound−global dual bound.
The gap is always nonnegative by construction, and if it is zero then we have solved the problem, the
feasible point that corresponds to the primal bound is optimal and the dual bound is a certificate
of optimality. In practice we terminate branch-and-bound when the relative gap (i.e., normalized
in some way, see §5) is below some application-dependent quantity, and produce the best found
primal solution as the approximately optimal solution.

Page 8
8
Variable
nodes
Constraint
nodes
Figure 3
Bipartite graph representation of a MIP used as the input to a neural network. The set of n variables
{x1,...,xn} and the set of m constraints {δ1,...,δm} form the two sets of nodes of the bipartite graph.
The coefficients are encoded as features of the nodes and edges.
3. MIP Representation and Neural Network Architecture
We describe how a MIP is represented as an input to a neural network, and the architecture we use
to learn models for both Neural Diving and Neural Branching. The key deep learning architecture
we use is a form of graph neural network (Scarselli et al. (2009), see survey by Battaglia et al.
(2018)), specifically a graph convolutional network (GCN) (Kipf and Welling 2016).
3.1. Representing a MIP as an Input to a Neural Network
We use a bipartite graph representation of a MIP, as done in Gasse et al. (2019). Equation (1)
can be used to define a bipartite graph where one set of n nodes in the graph correspond to the
n variables being optimized, and the other set of m nodes correspond to the m constraints; see
figure 3. The edge is present between a variable node and a constraint node if the corresponding
variable appears in that constraint, and so the number of edges correspond to the number of
non-zeros in the constraint matrix. The objective coefficients {c1,...,cn}, the constraint bounds
{b1,...,bm}, and the non-zero coefficients of the constraint matrix are encoded as scalar “features”
that annotate the corresponding variable nodes, constraint nodes, and edges, respectively. The
variable type (continuous or integer) can also be encoded as a (categorical) feature of the variable
nodes. This defines a lossless representation of the MIP that can be used as an input to a graph
neural network. A similar representation is used for MIPs by Ding et al. (2020), and for Boolean
Satisfiability instances in an earlier work by Selsam et al. (2019). Both nodes and edges can be
annotated by multi-dimensional feature vectors that encode additional information about the MIP
that can be useful for learning (e.g., the solution of the LP relaxation as additional variable node

Page 9
9
features). We use the code provided by Gasse et al. (2019) to compute the same set of features
using SCIP.1
3.2. Neural Network Architecture
We describe here the common aspects of the network architecture used by both Neural Diving
and Neural Branching. Those aspects that differ between the two, e.g., the outputs and the loss
functions, are given in their corresponding sections 6 and 7.
Given the bipartite graph representation of a MIP, we use a GCN to learn models for both Neural
Diving and Neural Branching. Let the input to the GCN be a graph G = (V,E,A) defined by the
set of nodes V, the set of edges E, and the graph adjacency matrix A. In the case of MIP bipartite
graphs, V is the union of n variable nodes and m constraint nodes, of size N := |V| = n + m. A
is an N × N binary matrix with Aij = 1 if nodes indexed by i and j are connected by an edge, 0
otherwise, and Aii = 1 for all i. Each node has a D-dimensional feature vector, denoted by ui ∈ RD
for the ith node. Let U ∈ RN×D be the matrix containing feature vectors of all nodes as rows,
i.e., the ith row is ui. A single-layer GCN learns to compute an H-dimensional continuous vector
representation for each node of the input graph, referred to as a node embedding. Let zi ∈ RH be the
node embedding computed by the GCN for the ith node, and Z ∈ RN×H be the matrix containing
all node embeddings as rows. We define the function computing Z as follows:
Z = Afθ(U),
(2)
where fθ : RD → RH is a Multi-Layer Perceptron (MLP) (Goodfellow et al. 2016) with learnable
parameters θ ∈ Θ. (Here we have generalized fθ from a linear mapping followed by a fixed nonlin-
earity in the standard GCN by Kipf and Welling (2016) to an MLP.) We overload the notation
to allow fθ to operate on N nodes simultaneously, i.e., fθ(U) denotes applying the MLP to each
row of U to compute the corresponding row of its output matrix of size N ×H. Multiplying by A
combines the MLP outputs of the ith node’s neighbors to compute its node embedding. The above
definition can be generalized to L layers as follows:
Z(0) = U
(3)
Z(l+1) = Afθ(l)(Z(l)), l = 0,...,L−1,
(4)
where Z(l) and fθ(l)() denote the node embeddings and the MLP, respectively, for the lth layer. The
Lth layer’s node embeddings can be used as input to another MLP that compute the outputs for
1 For the list of features, see https://papers.nips.cc/paper/2019/file/d14c2267d848abeb81fd590f371d39bd-Supplemental.
zip

Page 10
10
the final prediction task, as shown in sections 6 and 7. For each variable xd we further denote the
corresponding node embedding from the last layer as vd.
Two key properties of the bipartite graph representation of the MIP and the GCN architecture
are: 1) the network output is invariant to permutations of variables and constraints, and 2) the
network can be applied to MIPs of different sizes using the same set of parameters. Both of these
are important because there may not be any canonical ordering for variables and constraints,
and different instances within the same application can have different number of variables and
constraints.
3.3. Improvements to the Architecture
We describe changes to the above architecture, as well as from previous work, which gave perfor-
mance improvements.
1. We modify the MIP bipartite graph’s adjacency matrix A to contain coefficients from the
MIP’s constraint matrix A, instead of binary values indicating the presence of edges. Specifically,
for the ith variable and jth constraint, their two corresponding entries in A is set to aji where aji
is the coefficient at row j and column i of A. This results in edges weighted by the entries of A.
2. We extend the node embeddings for layer l + 1 by concatenating the node embeddings from
layer l. Specifically, we now define the embedding for layer l +1 to be CZ(l+1) = (Z(l+1), CZ(l)), i.e.,
the concatenation of the matrices row-wise, with CZ(0) = Z0. This is a form of skip connection (He
et al. 2016) and also similar to the jumping knowledge networks architecture (Xu et al. 2018).
3. We apply layer norm (Lei Ba et al. 2016) at the output of each layer, so that Z(l+1) =
LayerNorm(Afθ(l)(Z(l))).
We have explored alternative architectures which use embeddings for both nodes and edges with
separate MLPs to compute them. When using such networks with a high-dimensional edge embed-
ding at every layer, their memory usage can become much higher than that of GCNs, which only
need the adjacency matrix, and may not fit the GPU memory unless the number of layers is reduced
at the cost of accuracy. GCNs are a better fit for our goal of scaling to large-scale MIPs.
4. Datasets
We summarize the details of our datasets in table 1. All datasets except MIPLIB are application-
specific, i.e., they contain instances from only a single application. We include MIPLIB in our
evaluation even though it contains instances from many applications. This is not the setting in
which we expect learning will provide the most benefit. Nevertheless, it is a well-established bench-
mark for evaluating solvers, and the results on it can provide insight on the performance of learning

Page 11
11
Table 1
MIP datasets used for evaluation. See additional details in section 12.6
.
Name
Description
CORLAT
A small-scale dataset related to wildlife management (Conrad et al.
2007, Gomes et al. 2008). We include it as it is public, but our main
focus is on the larger-scale ones.
NN Verification
Verifying whether a neural network is robust to input perturbations can
be posed as a MIP (Cheng et al. 2017, Tjeng et al. 2019). Each input on
which to verify the network gives rise to a different MIP. In this dataset,
a convolutional neural network is verified on each image in the MNIST
dataset, giving rise to a corresponding dataset of MIPs.
Google Production
Packing
A packing optimization problem solved in a Google production system.
Google Production
Planning
A planning optimization problem solved in a Google production system.
Electric Grid Opti-
mization
Electric grid operators optimize the choice of power generators to use
at different time intervals during a day to meet electricity demand by
solving a MIP. This dataset is constructed for one of the largest grid
operators in the US, PJM, using publicly available data about generators
and demand, and the MIP formulation in (Knueven et al. 2018).
MIPLIB
Heterogeneous dataset containing ‘hard’ instances of MIPs across many
different application areas that is used as a long-standing standard
benchmark for MIP solvers (Gleixner et al. 2019). We use instances from
both the 2010 and 2017 versions of MIPLIB.
101
102
103
104
105
106
107
Number of variables
101
102
103
104
105
106
107
Number of constraints
Google Production Planning
CORLAT
Neural Net Verification
Google Production Packing
Electric Grid Optimization
101
102
103
104
105
106
107
Number of variables
101
102
103
104
105
106
107
Number of constraints
MIPLIB
Figure 4
Number of variables and constraints after presolve using SCIP 7.0.1 for the datasets used in our
evaluation. MIPLIB is shown separately (right) to reduce clutter.

Page 12
12
approaches in an unfavorable setting. The MIP sizes for the various datasets after presolving are
shown in figure 4.
For all datasets except MIPLIB, we define the training, validation, and test sets by randomly
splitting the instances into disjoint subsets with 70%, 15%, and 15% of the instances, respectively.
For MIPLIB, we use instances from the MIPLIB 2017 Benchmark Set as the test set, since that
is the set on which solvers are evaluated. The MIPLIB 2017 Collection Set and the MIPLIB
2010 set are used as the training and validation sets, respectively, after removing any overlapping
instances with the MIPLIB 2017 Benchmark Set. For each dataset, the training set is used to learn
models for that dataset, the validation set is used to tune the learning hyperparameters and SCIP’s
metaparameters, and the test set is used to report evaluation results. Additional details are given
in the Appendix, see section 12.6.
5. Evaluation
We shall evaluate Neural Diving and Neural Branching individually first, and then jointly in the
sequel. In all cases we evaluate on a test set of MIPs disjoint from the training set to measure
generalization to unseen instances. We present results using gap plots and survival plots:
Gap plots: Neural Diving is evaluated with respect to the primal gap γp(t) (Berthold 2013)
as a function of solving time t. If no feasible assignment is known then the gap is defined to be
1, otherwise denote by p(t) the primal bound at time t and denote by pthe best known primal
bound (possibly precomputed), the gap is defined as
γp(t) =
(
1
p(t)·p< 0
p(t)−p
max {|p(t)|,|p|,ϵ}
otherwise.
(5)
We use ϵ = 10−12 to avoid division by 0. To evaluate Neural Branching we use a dual gap, which is
defined analogously
γd(t) =
(
1
d(t)·p< 0
p−d(t)
max {|d(t)|,|p|,ϵ}
otherwise,
(6)
where d(t) is the global dual bound at time t. When evaluating Neural Diving and Neural Branching
jointly, we use the primal-dual gap
γpd(t) =
(
1
d(t)·p(t) < 0
p(t)−d(t)
max {|d(t)|,|p(t)|,ϵ}
otherwise.
(7)
In each case, we plot the average gap for test MIPs as a function of running time. We pre-compute
pfor a MIP by running SCIP on it with default parameters and a time limit of 24 hours. Any
suboptimality in pwill affect all solvers being evaluated and relative comparisons using the gaps
are therefore still valid. If during the solve we find a better primal bound then we replace pwith
the new value and recompute gaps at all previous times.

Page 13
13
Survival plots: A survival plot shows the fraction of test set MIPs solved (to the target primal-
dual gap) as a function of running time. Applications may specify a target optimality gap higher
than 0 as a termination condition. In such a case improving the gap below the target does not
improve the solve performance. The survival plot reflects the effect of the target gap, while the gap
plots do not.
Calibrated time: The total evaluation workload across all datasets and comparisons requires
more than 160,000 MIP solves and nearly a million CPU and GPU hours. To meet the compute
requirements, we use a shared, heterogeneous compute cluster. Accurate running time measure-
ment on such a cluster is difficult because the tasks may be scheduled on machines with different
hardware, and interference from other unrelated tasks on the same machine increases the variance
of solve times. To improve accuracy, for each solve task, we periodically solve a small calibration
MIP on a different thread from the solve task on the same machine. We use an estimate of the
number of calibration MIP solves during the solve task on the same machine to measure time,
which is significantly less sensitive to hardware heterogeneity and interference. This quantity is then
converted into a time value using the calibration MIP’s solve time on a reference machine. Section
12.7 gives the details. Results for four instances from MIPLIB show a 1.5× to 30× reduction in
the coefficient of variation of time measurements compared to measuring wall clock time.
Tuned SCIP: The main baseline we compare against is SCIP 7.0.1 with its parameters tuned
for each test dataset. SCIP has emphasis “meta-parameters” for presolving, primal heuristics, and
cuts, each of which has four possible settings (default, off, aggressive, fast). For each dataset, we
use exhaustive search over the 43 = 64 combinations to find the setting that produces the best
average primal-dual gap on a subset of 200 validation MIPs in a 3 hour time limit. We call this
baseline Tuned SCIP.
Performance variability with respect to random seed: To account for the performance
variability of MIP solvers (Lodi and Tramontani 2014) with respect to changes to a MIP that
leave the problem unchanged, such as permuting rows and columns of the constraint matrix, we
vary the random seed parameter used by SCIP. Specifically, we set the SCIP parameter ran-
domization/permutevars to True, and assign the parameters randomization/permutationseed and
randomization/randomseedshift both to a give seed value. The seed is set to the values {1,2,3,4,5}
for each instance. The evaluation results reported for Neural Diving, Neural Branching, and their
combination are computed by aggregating over all the instance-seed pairs produced using the test
set instances.

Page 14
14
6. Neural Diving
In this section we describe our approach to learning a diving-style primal heuristic that produces
high quality assignments to MIPs from a given instance distribution. The idea is to train a gener-
ative model over assignments to a MIP’s integer variables from which partial assignments can be
sampled. We use SCIP to obtain high-quality assignments (not necessarily optimal) as the target
labels for the training set of MIPs. Once trained on this data, the model predicts values for integer
variables on unseen instances from the same problem distribution. The uncertainty represented in
the model predictions is used to define partial assignments to the original MIP that fix a large
fraction of the integer variables. These substantially smaller sub-MIPs can then be solved quickly
using SCIP, yielding high quality feasible assignments.
Diving refers to a set of primal heuristics that explore the branch-and-bound tree in a depth-first
manner by sequentially fixing integer variables until a leaf node is reached or the assignment is
deemed infeasible (Berthold 2006, Eckstein and Nediak 2007). There are a few major differences
between regular diving and what we describe here. First, diving can be started from any node, but
in this work we focus on diving only from the root node, though in principle it could be performed
from other nodes. Secondly, diving typically descends all the way to a leaf node, but in this case we
descend only partially and then use a MIP solver to solve the remaining sub-MIP. For this reason
one could reasonably call our method a hybrid of diving and neighborhood search (Mladenovic and
Hansen 1997, Shaw 1998, Hansen et al. 2010), but for brevity we refer to it simply as Neural Diving.
Finally, diving usually proceeds in an iterative manner where decisions are made sequentially and
the linear program is re-solved after each decision. The vanilla version of Neural Diving produces
assignments that define the sub-MIP entirely in parallel, and only re-solves the linear program once.
In section 12.1 we describe an extension where the decisions are made sequentially. Our approach
is similar to Relaxation Enforced Neighborhood Search (RENS) (Berthold 2007) in that we fix
a subset of variables and solve the resulting sub-MIP, but in our case the variables are assigned
values predicted by a learned model rather than based on the linear program solution, and we use
multiple partial assignments instead of only one.
6.1. Solution Prediction as Conditional Generative Modelling
Consider an integer program (i.e., all variables are integers) with parameters M = (A, b, c) (see
equation 1) and a nonempty feasible set over a set of integer variables x. Assuming minimization,
we define an energy function over x using the objective function:
E(x;M) =
(
cT x if x is feasible,
otherwise,
(8)

Page 15
15
which then defines the conditional distribution
p(x|M) =
exp(−E(x;M))
Z(M)
(9)
where Z(M) is the partition function that normalizes the distribution to sum to 1:
Z(M) =
x
exp(−E(x ;M)).
(10)
Feasible assignments with better (i.e., lower) objective values have higher probability. Infeasible
assignments have zero probability. Note that multiplying c by a constant β changes the distribution,
even though the feasible set and the optimal assignment(s) remain the same. (β can be interpreted
as the inverse temperature parameter of the distribution.) The distribution can be made invariant
to such a re-scaling of the objective by normalizing the energy as E(x;M)/|E(x;M)| where xis
an optimal assignment. We have not used such normalization in this work.
When a MIP contains both integer variables xI and continuous variables xC, a given assignment
for xI defines a linear program (LP) on xC. The energy function E(x;M) can then be computed by
assigning xC to the optimal LP solution x
C (if feasible) and then setting E(x;M) to the objective
value of the resulting complete assignment.
6.1.1. Learning The learning task is to approximate p(x|M) using a generative model
pθ(x|M) parameterized by θ. The training dataset is Dtrain = {(Xi,Mi)}N
i=1, where {Mi ∼ p(M)}N
i=1
are N independent and identically distributed (IID) samples from the application-specific MIP
distribution p(M) and Xi = {xi,j}Ni
j=1 is a set of unique Ni assignments for the instance Mi. Xi
is obtained by running SCIP on Mi and collecting feasible assignments it finds during the solve.
While potentially costly, this data collection step needs to be done only once (per application), and
outside the training loop. We learn on all assignments (explained below), not just the best ones,
and do not require that any of the assignments be optimal.
The model parameters θ are learned by minimizing the following weighted loss function with
respect to θ:
L(θ) = −
N
i=1
Ni
j=1
wij logpθ(xi,j|Mi),
(11)
where the weights wij are used to reduce any bias in sampling the training examples Xi = {xi,j}Ni
j=1
for the instance Mi. Consider the case where Xi contains all possible feasible assignments of the
integer variables for Mi without any duplicates (i.e., xi,j = xi,k, ∀j = k). Then we can use the
weights
wij =
exp(−cT
i xi,j)
Ni
k=1 exp(−cT
i xi,k)
.
(12)

Page 16
16
Variable
nodes
Constraint
nodes
Graph
Convolutional
Network
Variable
nodes
Constraint
nodes
Multi-Layer
Perceptron
MIP Bipartite Graph
Inferred Node Embeddings
Bernoulli Distributions
for Binary Variables
Figure 5
Conditionally independent model (Section 6.1.2) infers node embeddings {v1,...,vn} for the variable
nodes of the bipartite graph representation of an input MIP M using a graph convolutional network
and applies a multi-layer perceptron to the node embeddings {vd}d∈I of the set of binary variables I in
M to compute their Bernoulli distributions. (See Section 6.1.3 for the case of general integers.) These
distributions are independent of each other conditioned on the input MIP.
in equation 11 to learn the target distribution p(x|Mi) using Xi. Enumerating all possible assign-
ments to construct Xi is not practical for MIPs of realistic size. Instead we apply an off-the-shelf
solver, SCIP in our case, to Mi. We define Xi to include all feasible assignments found throughout
the solve, with duplicates removed. Using the weights given by equation 12 no longer corresponds
to learning the exact target distribution p(x|Mi) as the partition function is approximate. Although
approximate, learning still succeeds in producing models with strong empirical performance on the
datasets used in our evaluation. Intuitively we expect the approximation to produce good results
because it assigns higher weights to better assignments.
6.1.2. Conditionally-independent Model There are several choices for parameterizing the
generative model (see, e.g., Bishop (2006), Bengio and Bengio (2000), Kingma and Welling (2014)).
Let I be the set of dimensions of x corresponding to the integer variables. Let xd denote the dth
dimension of x. We use a model of the form:
pθ(x|M) =
d∈I
pθ(xd|M),
(13)
which predicts a distribution for xd that is independent of the other dimensions conditioned on M.
For simplicity, we first assume that each xd is binary with possible values {0,1} (the general
integer case is considered in section 6.1.3). We use the Bernoulli distribution for each such variable.
The success probability µd for the Bernoulli distribution pθ(xd|M) is computed as
td = MLP(vd;θ),
(14)
µd = pθ(xd = 1|M) =
1
1 + exp(−td)
,
(15)

Page 17
17
where vd is the embedding computed by a graph convolutional network for the MIP bipartite graph
node corresponding to xd, as described in section 3.2. Note that the same MLP is used for all
variables, see figure 5.
While the conditionally-independent model cannot accurately model a multimodal distribution
over assignments, empirically, it still shows strong performance in our experiments. We also inves-
tigated more sophisticated models, such as autoregressive models (Bengio and Bengio (2000), see
Section 12.1), which provide modest improvements but at a much higher inference cost.
6.1.3. Handling General Integers The model can be generalized to non-binary integer
variables with some modifications. The main challenges are: a) the cardinality of an integer variable
can vary significantly across instances, and b) it can potentially be very large (e.g., 107). This makes
simple approaches such as assuming a fixed maximum cardinality highly inefficient and difficult to
learn with.
We address these challenges by reframing the prediction task for general integer variables as a
sequence of binary prediction tasks, based on the binary representation of the target integer value.
For an integer variable z that can be assigned values from a finite set with cardinality card(z),
any target value can be represented as a sequence of ⌈log2(card(z))⌉ bits. We train our model to
predict these bits in sequence, from most significant to least significant bit. Since the maximum
cardinality of variables in a test instance becomes known only during inference, and is unknown
during training, we introduce a hyperparameter nb that controls the maximum number of bits
predicted for each variable along this bit sequence. We can then train our model to predict the nb
most significant bits of the value for z, given the upper and lower bounds for z. During inference
then, if ⌈log2(card(z))⌉ exceeds nb, we can still predict the nb most significant bits of the value for
z, and use these predictions to tighten the bounds of z by a factor of at least 2nb−1. Otherwise,
if ⌈log2(card(z))⌉ ≤ nb, then we can predict z exactly to a single value. Note that this approach
of bit-wise prediction of values can also be seen as predicting branching decisions on variables:
Predicting one bit means picking the left or right branch of a binary tree, and the possible range
of the integer variable is successively tightened after each prediction.
6.2. Combining Model Predictions with a Classical Solver
Once a model is learned, we can then sample variable assignments x ∼ pθ(x|M) for a new problem
M. While we could try to use these samples directly, they need not be feasible or provide the best
objective value. Instead, it is more effective to only fix a (large) subset of the variables to their
sampled values, and delegate the solution search for the remaining open variables to a classical
solver, in our case, SCIP.

Page 18
18
We use the SelectiveNet approach (Geifman and El-Yaniv 2019) to train an additional binary
classifier that decides which variables to predict a value for and which to refrain from predicting,
and optimizes for “coverage” among variables, defined as the ratio of the number of variables
predicted vs not predicted. Specifically, we introduce an additional output yd ∈ {0,1} for each
variable xd in the input M that determines whether xd should be assigned (yd = 1) or not. For the
conditionally-independent model, we can then train our model by minimizing the following loss
function:
lselective(θ,x,M) =
−∑d∈I logpθ(xd|M)·yd
d∈I yd
+λΨ(C −
1
|I|
d∈I
yd),
(16)
Lselective(θ) =
i,j
wij ·lselective(θ,xi,j,Mi).
(17)
Here, C is a coverage threshold representing the desired relative frequency of assigned variables,
Ψ is a quadratic penalty term, and λ is a hyperparameter controlling the relative importance of
achieving a coverage close to the set threshold. We train multiple models simultaneously with
different coverage thresholds (typically, values from 0.1 to 0.95, and are tuned on the validation
set).
By assigning, or tightening the bounds of a large fraction of the variables we significantly reduce
the problem size, and warm-start SCIP to find high quality solutions in much shorter time. The
idea here is that by sampling a solution from pθ(x|M), we move to a promising region of the
solution space; and by removing again some of the assignments we expand the neighbourhood in
which we then search for the optimal solution.
This approach also offers practical computational advantages: both the sampling of predictions,
and the solution search afterwards are fully parallelizable. We can repeatedly and independently
extract many different samples from our model, and each partial assignment of the sample can be
independently solved. For a single MIP instance, we can generate many such partial assignments,
each of which we can independently solve with SCIP. The feasible solution with the best objective
value across all sub-MIPs is then reported as the final solution to the original input MIP. Our
approach can hence easily leverage the power of distributed computing even when solving for a
single instance, which is not possible with default SCIP (without significant effort (Shinano et al.
2011)). When comparing results, we will show results for both the parallel setting (where we make
full use of the parallelism advantage of our approach) and the sequential setting (where we run our
approach in a sequential manner, controlling for the total amount of computational resources).

Page 19
19
102
103
Calibrated time (seconds)
0.0
0.1
0.2
0.3
0.4
0.5
Average primal gap
CORLAT
Tuned SCIP
Neural Diving (parallel)
Neural Diving (sequential)
102
103
Calibrated time (seconds)
0.0
0.1
0.2
0.3
0.4
0.5
Average primal gap
NN Verification
102
103
104
Calibrated time (seconds)
0.0
0.1
0.2
0.3
0.4
0.5
Average primal gap
Google Production Packing
102
103
104
Calibrated time (seconds)
0.0
0.1
0.2
0.3
0.4
0.5
Average primal gap
Google Production Planning
102
103
104
Calibrated time (seconds)
0.0
0.1
0.2
0.3
0.4
0.5
Average primal gap
Electric Grid Optimization
102
103
104
Calibrated time (seconds)
0.0
0.1
0.2
0.3
0.4
0.5
Average primal gap
MIPLIB
Figure 6
Average primal gap between the primal bound and the best known objective value achieved by various
algorithms as a function of running time on the benchmark datasets.
6.2.1. Generating partial assignments As described above, to produce partial assignments
to an input MIP, we use the output of our generative models to decide both which values each
integer variable should take, as well as which variables to assign: The acceptance head decides which
variables are fixed or tightened, while the prediction head provides the variable values. Algorithm 1
describes this procedure.
We can repeat the above procedure many times, by obtaining multiple samples from a model,
and also by using multiple models trained with different choices for the coverage threshold for the
acceptance decision, to produce a diverse set of sub-MIPs. In experiments with the conditionally
independent model we use one sample per model, and generate different sub-MIPs by using models
trained with different coverage thresholds for y.

Page 20
20
102
103
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with primal gap <= 0.0
CORLAT
Tuned SCIP
Neural Diving (parallel)
Neural Diving (sequential)
102
103
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with primal gap <= 0.05
NN Verification
102
103
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with primal gap <= 0.01
Google Production Packing
102
103
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with primal gap <= 0.03
Google Production Planning
102
103
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with primal gap <= 0.0001
Electric Grid Optimization
102
103
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with primal gap <= 0.0
MIPLIB
Figure 7
Fraction of instances with primal gap less than or equal to the target gap achieved by various algorithms
as a function of running time on the benchmark datasets.
101
102
103
Real time (seconds)
0.0
0.1
0.2
0.3
0.4
0.5
Average primal gap
NN Verification
Tuned SCIP
Gurobi
Neural Diving + SCIP (parallel)
Neural Diving + Gurobi (parallel)
101
102
103
104
Real time (seconds)
0.0
0.1
0.2
0.3
0.4
0.5
Average primal gap
Google Production Packing
Figure 8
Average primal gap between the primal bound and the best known objective value achieved by various
algorithms as a function of running time on the benchmark datasets.

Page 21
21
101
102
103
Real time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with average primal gap <= 0.05
NN Verification
Tuned SCIP
Gurobi
Neural Diving + SCIP (parallel)
Neural Diving + Gurobi (parallel)
101
102
103
104
Real time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with average primal gap <= 0.01
Google Production Packing
Figure 9
Fraction of instances with primal gap less than or equal to the target gap achieved by various algorithms
as a function of running time on the benchmark datasets.
6.2.2. Sequential and parallel sub-MIPs solving The sub-MIPs that we generate in the
previous step can be solved fully independently from each other, allowing us to search for solutions
for all sub-MIPs in parallel. In this parallel setting, we generate up to 100 sub-MIPs by applying
combinations of different models and SCIP random seeds, and distribute each sub-MIP onto their
own dedicated machine. Since each sub-MIP in this setting is solved in parallel, we report results
in terms of (calibrated) wall clock time, meaning that we take results from the best performing
sub-MIP at any given time.
While the parallel setting showcases the advantage that our approach has when distributed
compute resources are available, we also set up a comparison that controls for the amount of
computational resources used. That is, in addition to the parallel setting, we also compare our
approach to SCIP when both are run sequentially on a single machine. In this setting, we are still
generating multiple sub-MIPs as before, but process them sequentially in a random order, instead
of in parallel.
6.3. Results
We trained a GNN for every dataset on the training set and tuned hyperparameters (number of
layers, width of layers, learning rate, exponential decay steps and assignment coverage thresholds)
on the validation set. We report results on average primal gap vs our baseline SCIP in Figure
6. Our parallel and sequential runs overall produce better primal bounds in shorter time on all
datasets, compared to SCIP. We believe that the strength of our approach is in quickly finding
good solutions, but it sometimes fails to find an optimal or near-optimal solution. This can can be
seen e.g. on the survival plots in figure 7 where Neural Diving approach wins at shorter time limits,
but loses to SCIP at the end on the Electric Grid Optimization and MIPLIB datasets. Note that
SCIP’s internal definition of gap is different from the one used in this paper, so in all experiments
we set the relative gap setting to 0 in order to avoid stopping prematurely.

Page 22
22
Algorithm 1: Generating variable assignments and tightenings
Input: Learned distributions pθ(x|M), pθ(y|M), MIP instance M
Output: Variable assignment and bound tightenings
assignment := {}
tightenings := {}
for xi ∈ Variables(M) do
if xi is binary variable then
Sample px from pθ(xi|M)
Sample y from pθ(yi|M)
if y = 1 then
Add (xi := round(px)) to assignment
if xi is non-binary integer variable then
lb := lower bound of xi
ub := upper bound of xi
b0, ..., bk := binary representation of (ub−lb), b0 being most significant bit
for j ∈ {0, ..., k} do
Sample px from pθ(xi,j|M)
Sample y from pθ(yi,j|M)
if y = 1 then
if round(px)=1 then
lb := lb+⌈(ub−lb)/2⌉
else
ub := lb+⌊(ub−lb)/2⌋
else
Add (lb ≤ xi ≤ ub) to tightenings
break
return assignment,tightenings
We include additional results on two datasets (Google Production Packing and NN Verification,
see figures 8 and 9) where we combine Neural Diving with the Gurobi solver (Gurobi Optimization
2020): we assign variables in the same way, but use Gurobi instead of SCIP to solve the remaining
problem. The reported results are in non-calibrated time as we do not have access to the Gurobi
internals and cannot report the calibrated time during solving.
6.3.1. Finding incumbents to open MIPLIB instances In addition to evaluating Neural
Diving on the datasets from Table 1, we also applied a modification of our approach on solving open
instances from the MIPLIB 2017 Collection Set. These instances are classified as open, because
no optimal solution is known: Either there is no feasible solution known; or there exists a known

Page 23
23
Table 2
Incumbents found for open MIPLIB 2017 instances by Neural Diving. Lower objective value is better.
Instance
Neural Diving Previous best
objective value objective value
milo-v12-6-r1-75-1 1153756.398
1153880
neos-1420790
3121.29
3121.42
xmas10-2
-497
-495
feasible solution (incumbent), but it does not match any known dual bound. In this sense, these
instances are the hardest problems that the MIPLIB 2017 Collection Set offers.
Apart from their inherent hardness, these instances also impose an additional challenge to our
learning-based approach: The set of instances is almost entirely heterogeneous, in that there are
only a handful of related instances available per problem, if at all. Our approach of learning a
domain-specific primal heuristic hence needed to be adapted. To tackle this challenge, we built a
training set for each attempted open instance which should represent the problem distribution of
the target instance. More specifically, for a given target instance, we generated up to 500 similar
instances through randomly applying a mix of the following manipulations: (1) Randomly dropping
constraints of the target instance, and (2) fixing a random subset of variables to the previous
incumbent. To generate each such instance, we applied 10 iterations of each step (1) and (2) on
the original target instance.
After creating such a training set for each target instance, we apply our approach as described
in earlier sections: We trained a separate GNN for each target instance, using the generated train-
ing set, with label solutions found on the training set using SCIP. After training, we used the
corresponding GNN to produce up to 10 partial assignments to the target instance (as described
in Section 6.2.1), and used Gurobi to solve the sub-MIPs defined by the partial assignments. For
three open instances, this approach yielded new incumbents that were not known before. Table 2
shows the improved objective values compared the previous best known solution.
7. Neural Branching
A branch-and-bound procedure has two decisions to make every iteration: which leaf node to
expand, and which variable to branch on. In this work we focus on the latter. The quality of the
variable selection decision can have a large impact on the number of steps taken by branch-and-
bound to solve a MIP (Achterberg et al. 2005, Glankwamdee and Linderoth 2011, Schubert 2017,
Yang et al. 2019). We use a deep neural network to learn a variable selection policy by imitating
the actions of a node-efficient, but computationally expensive, expert. By distilling the policy of
the expensive expert into the neural network we seek to maintain approximately the same decision

Page 24
24
quality but substantially reduce the time taken to make the decision. The decision at a given tree
node is entirely local to that node, so a learned policy only needs to have a representation of the
node as input, rather than the entire tree, which makes it more scalable.
7.1. Expert policy
We would like an expert policy that solves MIPs by building small branch-and-bound trees, since
such an expert is making good branching decisions. Among the many branching policies proposed
in the optimization literature (see, e.g., Achterberg et al. (2005)) none provably achieve the smallest
trees, but empirically full strong branching (FSB) (Achterberg et al. 2005) tends to use fewer steps
than competing approaches. It performs one-step lookahead search by simulating one branching
step for all candidate variables and picking the one that provides the largest improvement in the
dual bound, as determined by the solution to the linear program at the new node. (A generalization
to k-step lookahead search can be found in Glankwamdee and Linderoth (2011), Schubert (2017).)
This requires solving 2 × ncands linear programs, where ncands is the number of possible branching
candidates at a given step of branch-and-bound. Concretely, denote by xthe solution to the LP
relaxation of (1) at the current node we are branching from with variable constraint vectors l and
u, i.e., l ≤ x≤ u. Then for each variable candidate i, FSB needs to solve the following two LPs:
minimize c xup
subject to Axup ≤ b
l(i) ≤ xup ≤ u
minimize c xdown
subject to Axdown ≤ b
l ≤ xdown ≤ u(i)
(18)
over variables xup ∈ Rn and xdown ∈ Rn, where l
(i)
i
= ⌈x
i ⌉ and u(i)
i
= ⌊x
i ⌋, and other entries of l(i)
and u(i) are unchanged from their values at the current node, l and u. It combines the optimal
values of these LPs into a score for that candidate, and it uses those scores to decide which variable
to branch on.
For practical MIP solving FSB’s per-step computational cost is often so high that it can be
used only for a few branch-and-bound steps before the running time becomes prohibitively high.
SCIP’s default variable selection policy, reliable pseudocost branching (RPB) (Achterberg et al.
2005), uses FSB for a small number of steps at the start of branch-and-bound and switches to a
heuristic with lower per-step cost for the subsequent steps. But an expert policy used as the target
for imitation learning can afford a higher per-step cost. The expert is only used to generate training
data offline as a one-time expense by running it on a training set of MIPs and recording its inputs
and outputs at each step. Even then, on large-scale instances (e.g., with > 105 variables), offline
data generation can be prohibitively slow, which prevents learning approaches from scaling to such
instances. Therefore we speed up the expert policy by exploiting GPUs, as explained next.

Page 25
25
9000
9200
9400
9600
9800
10000
Number of variables at a node
0
250
500
750
1000
1250
Number of nodes
Google Production Packing
Vanilla FSB
ADMM
1
2
3
4
5
6
7
8
Number of variables at a node (x 105)
0
500
1000
1500
2000
Number of nodes
MIPLIB
Vanilla FSB
ADMM
Figure 10
Histogram of the number of imitation learning examples, where each example is a node in a branch-
and-bound search tree, generated by the ADMM expert and the Vanilla Full Strong Branching expert
on the training sets for Google Production Packing and MIPLIB in the same time limit.
7.1.1. ADMM batch LP solving SCIP by default uses the LP solver SOPLEX (Gamrath
et al. 2020) based on the simplex algorithm (Dantzig 1998) for solving the linear programs aris-
ing from FSB. Although in principle the simplex algorithm can be parallelized, it is not easy to
batch the computation together for several LPs that are closely related (and therefore leverage a
GPU), and consequently SOPLEX simply solves the LPs sequentially on CPU. In order to scale
to a large number of candidate variables at a reasonable cost we wrote our own batch LP solver
using the alternating directions method of multipliers (ADMM) (Boyd et al. 2011), based on the
Splitting Conic Solver (SCS) algorithm for cone programs (O’Donoghue et al. 2016, O’Donoghue
2020). This solver can simultaneously handle many LPs in a batch that we process on GPU. This
can (approximately) solve all the LPs in full strong branching substantially faster than running
SOPLEX sequentially. Moreover, since we are using this solver merely to produce training data, we
can tune the algorithm parameters with the primary goal of producing smaller search trees, even
at the cost of some data generation time. It is the output produced by this ADMM-based policy
that we will train a graph neural network to imitate. The graph neural network is able to produce
actions that are close enough to the policy of the ADMM expert to produce small search trees,
but at a fraction of the computational cost. We provide full details in the Appendix, section 12.5.
To demonstrate the improved scalability of the expert data generation, we compare the ADMM
expert to the expert used by Gasse et al. (2019) called Vanilla Full Strong Branching (VFSB) on
Google Production Packing and MIPLIB. VFSB is a modified version of SCIP’s FSB implementa-
tion that is still performed sequentially but disables its ‘side effects’, i.e., changes to the solver state
not directly due to a branching decision itself, but due to information gained during the lookahead
search over branching candidates. Figure 10 shows a histogram of the number of variables in each
branch-and-bound node generated by the two experts within 60 hours, on nodes with the largest
number of variables for both datasets (> 9000 variables for Google Production Packing and > 105

Page 26
26
variables for MIPLIB). The ADMM expert generates 31.7× and 1.6× more data on nodes with
the largest number of variables for Google Production Packing and MIPLIB, respectively. Across
the full range of the number of variables, the ADMM expert generates 12.1× and 1.4× more data
on Google Production Packing and MIPLIB, respectively. The largest node that the VFSB expert
manages to generate data from on MIPLIB has 3.4×105 variables, while the ADMM expert is able
to generate data from nodes with almost 106 variables.
7.2. Imitation Learning
Imitation learning is the name given to the broad family of algorithms that seek to learn the
policy of an expert, given examples of expert behavior. In most cases, and in this paper, this is
formulated as a supervised learning problem. We consider three variants: 1) cloning an expert
policy (Pomerleau 1989, Bain and Sammut 1995) 2) distillation with random moves, and 3) DAgger
(Ross et al. 2011). Distillation is the simplest of the three and it simply attempts to learn to
predict the outputs of the expert at each node. Distillation is not robust to shifts away from the
expert’s state distribution. This is particularly problematic in the case of branching as mistakes
near the root can cause the subsequent nodes visited in the search tree to be very different than
those seen during training, since the expert never visited those nodes. Mixing random moves with
expert moves is a simple heuristic that can alleviate this problem to some degree. When running
the expert we take a random action with probability 10% at each node. We can run the expert
many times on each training MIP generating slightly different data each time, thereby increasing
the total amount of training data available. In all our experiments we did this five times for each
MIP. DAgger is another, more involved, procedure where we train an agent using distillation, run
it in a branch-and-bound procedure to make decisions, but also compute the expert outputs at
each step to use as targets for learning. This produces one step of DAgger data upon which we can
re-train the agent. In theory this procedure can be repeated using the new agent trained on the
previous agent DAgger data, but in this manuscript we only consider a single iteration of DAgger.
In our experiments, we treat the choice of the three imitation learning variants as a hyperpa-
rameter to tune for each dataset. We generate data and train policies using all three variants for
each dataset, and we select the policy that achieves the lowest average dual gap on the validation
set instances in a 3 hour time limit. The selected policy is then evaluated on the test set to report
results.
7.3. Implementation details
The ADMM expert solves a pair of LPs for each branching candidate (the ‘up’ and ‘down’ branches).
Let us denote by OPTup
i
and OPTdown
i
the (estimated) optimal values of the LP for the up and

Page 27
27
down branches respectively for candidate i, and denote by OPT the objective value of the LP at
the node we are branching from. We combine these numbers into a single score for each variable
defined as
si = (OPTup
i
−OPT +ϵ)(OPTdown
i
−OPT +ϵ)
(19)
where ϵ = 10−4 (this is essentially the product rule from Achterberg (2009)). Given a set C of
candidate variables for branching, we convert their scores into a categorical distribution over the
candidates using
pexpert
i
=
si
c∈C sc
.
On the training set of MIPs we ran the ADMM expert, with random moves or initialized using
DAgger as appropriate, and logged the expert scores taken at each node (we did not log the action
if it was selected randomly or selected by the agent in the DAgger setting - only the ADMM expert
action was ever logged). This provides our training dataset for the imitation learning agent.
The neural network policy for variable selection has the same graph convolutional network archi-
tecture described in section 3.2. It conditions on a leaf node of the branch-and-bound tree selected
by SCIP to branch on. As the node itself is a MIP, it can be represented using the bipartite graph
representation described in section 3.1. Its output is a categorical distribution over the set of can-
didates C. Let M be the MIP bipartite graph representation of the leaf node, vc be the embedding
computed by a graph convolutional network for the bipartite graph node for the branching candi-
date variable xc, as described in section 3.2, and φ be the learnable parameters of the policy. The
probability pφ(xc|M) of selecting xc for branching at the node is given by:
tc = MLP(vc;φ),
(20)
pφ(xc|M) =
exp(−tc)
c ∈C exp(−tc )
.
(21)
When training, the neural network receives a batch of node features with the associated expert
scores and produces a set of probabilities pgraphnet for each node in the batch by passing the
activations of the last layer through a softmax. We used batch size 8 sampled uniformly at random
from the training dataset. We tried a variety of losses and found that negative cross-entropy loss
performed best on average, which is given by
L(φ) =
c∈C
pexpert
c
logpφ(xc|M).
The loss is minimized using ADAM (Kingma and Ba 2014) with a learning rate of 10−4. Figure 16 in
the appendix shows that imitation learning succeeds to accurately approximate the expert policy,
both on the training and test sets.

Page 28
28
7.4. Results
We evaluate the learned branching policy on the task of optimizing the dual bound. As explained
in section 5, for comparison we use the dual gap γd(t) with respect to a pre-computed best known
primal bound pas a function of time. The gap is averaged over all MIPs in the test set of a
dataset.
Figure 11 shows the average dual gap curves for Neural Branching and Tuned SCIP. We also
compare to SCIP’s Full Strong branching policy (run with all SCIP parameters set to default,
except the branching policy) as a node-efficient baseline. Neural Branching achieves significantly
smaller average dual gap for the same running time on four out of six datasets. (Note the log
scale of the y-axis.) MIPLIB and Google Production Planning are the only datasets where Neural
Branching is not the best, but it is still comparable to Tuned SCIP. The improvements given by
Neural Branching can be substantial – e.g., its average dual gap is about 7× better on CORLAT,
20× better on Neural Network Verification, and 2× better on Electric Grid Optimization, compared
to Tuned SCIP at large time limits. On Google Production Packing, Neural Branching is better
at intermediate times (e.g., about 1.5× better at 103 seconds), with Tuned SCIP catching up
eventually.
Figure 12 shows survival plots computed by applying a dataset’s target optimality gap to the
dual gap of each test set MIP instance. They confirm the conclusions from figure 11 – Neural
Branching solves a higher fraction of test instances consistently across all time limits on all datasets
except Google Production Planning and MIPLIB.

Page 29
29
101
102
103
Calibrated time (seconds)
10 4
10 2
100
Average Dual Gap
CORLAT
Full Strong
Neural Branching
Tuned SCIP
101
102
103
104
Calibrated time (seconds)
10 3
10 2
10 1
100
Average Dual Gap
NN Verification
101
102
103
104
Calibrated time (seconds)
10 1
100
Average Dual Gap
Google Production Packing
101
102
103
104
Calibrated time (seconds)
10 1
100
Average Dual Gap
Google Production Planning
101
102
103
104
Calibrated time (seconds)
10 3
10 2
10 1
100
Average Dual Gap
Electric Grid Optimization
101
102
103
104
Calibrated time (seconds)
10 1
100
Average Dual Gap
MIPLIB
Figure 11
Average dual gap with respect to a pre-computed best known primal bound as a function of running
time on the test set of benchmark datasets.

Page 30
30
101
102
103
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems with dual gap <= 0.0
CORLAT
Full Strong
Neural Branching
Tuned SCIP
101
102
103
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with dual gap <= 0.05
NN Verification
101
102
103
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with dual gap <= 0.01
Google Production Packing
101
102
103
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with dual gap <= 0.03
Google Production Planning
101
102
103
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with dual gap <= 0.0001
Electric Grid Optimization
101
102
103
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems with dual gap <= 0.0
MIPLIB
Figure 12
Survival plots computed by applying a dataset’s target optimality gap to the dual gap of each test set
MIP instance.
8. Joint Evaluation
We now combine Neural Branching and Neural Diving into a single solver. As we shall see, this
results in significant speedups over Tuned SCIP. We use the learned heuristics by integrating them
into SCIP via the interface provided by PySCIPOpt (Maher et al. 2016). We consider all four
possible ways of combining Neural Branching and Neural Diving with SCIP: 1) Tuned SCIP alone,
2) Neural Branching + Neural Diving (Sequential) uses both neural heuristics, 3) Neural Branching
uses only the learned branching policy, and 4) Tuned SCIP + Neural Diving (Sequential) uses
only the sequential version of Neural Diving. When Neural Diving (Sequential) is used as a primal
heuristic, we disable all built-in primal heuristics of SCIP.
To make a comparison based on Neural Diving fairer in terms of compute resources, we only
consider its sequential version here. For a single MIP solve, Neural Diving (Sequential) and Neural

Page 31
31
Branching each use a CPU core and a GPU. So Neural Branching + Neural Diving (Sequential)
uses two cores and two GPUs. We also give Tuned SCIP two cores, which it uses by running two
independent SCIP solves on the input MIP with different random seeds. The primal, dual, and
primal-dual gaps for Tuned SCIP at a given time are then computed by selecting the best primal
and dual bounds across the two runs at that time. Since SCIP does not use GPUs, we do not
control for that resource when comparing Tuned SCIP and Neural Solvers.
For each expert we solve each MIP in the test dataset five times, with five different seeds for
SCIP. We present the primal-dual gap γpd(t) (see section 5) averaged over each dataset’s test MIPs
as a function of calibrated time. We also present survival plots that show the fraction of test MIPs
solved within the required dataset-specific optimality gap as a function of calibrated time.
8.1. Results
Figure 13 shows the average primal-dual gap curves as a function of running time. A Neural Solver
significantly outperforms Tuned SCIP on four of the datasets. (Note the log scale of the y-axis.)
The average primal-dual gap is more than five orders of magnitude better than Tuned SCIP on
Neural Network Verification at large time limits. On Google Production Packing, Neural Branching
+ Neural Diving (Sequential) achieves low gaps much more quickly, reaching a 0.1 gap in more than
5× less time, but Tuned SCIP eventually catches up. On Electric Grid Optimization, it achieves
a 2× lower gap at higher running times. On MIPLIB, Tuned SCIP + Neural Diving (Sequential)
achieves a 1.5× better gap at higher running times.
Figure 14 shows the survival plots as a function of running time. They further confirm the
observations from figure 13, with a Neural Solver solving a higher fraction of test set problems
at a given time limit than Tuned SCIP on four of the datasets. Together, the results show that
learning can be used to significantly improve a strong solver such as SCIP. The improvements are
seen across different problem scales and applications. Surprisingly, learning improves performance
even on MIPLIB, a heterogeneous dataset combining many applications that a priori would not
be considered similar enough to have sufficient shared structure for learning to exploit. That the
improvements are observed across diverse settings confirms the broad usefulness of learning.
A further comparison with respect to the Penalized Average Runtime metric can be found in
Appendix section 12.4. The overall conclusions remain the same.
9. Related Work
Learning and MIP Primal Heuristics: Learning has previously been used to switch among an
ensemble of existing primal heuristics during a branch-and-bound run. Khalil et al. (2017b) learn a
binary classifier offline to predict whether applying a primal heuristic at a branch-and-bound node

Page 32
32
101
102
103
Calibrated time (seconds)
10 4
10 2
100
Average Primal-Dual Gap
CORLAT
Full Strong
Neural Branching
Neural Branching + Neural Diving (Sequential)
Tuned SCIP
Tuned SCIP + Neural Diving (Sequential)
101
102
103
104
Calibrated time (seconds)
10 6
10 4
10 2
100
Average Primal-Dual Gap
NN Verification
101
102
103
104
Calibrated time (seconds)
10 1
100
Average Primal-Dual Gap
Google Production Packing
101
102
103
104
Calibrated time (seconds)
10 1
100
Average Primal-Dual Gap
Google Production Planning
101
102
103
104
Calibrated time (seconds)
10 3
10 2
10 1
100
Average Primal-Dual Gap
Electric Grid Optimization
101
102
103
104
105
Calibrated time (seconds)
10 1
100
Average Primal-Dual Gap
MIPLIB
Figure 13
Primal-dual gap achieved by various combinations of primal heuristics and variable selection policies
as a function of running time on the benchmark datasets. The gap is computed for each MIP in the
test set of each dataset and then aggregated using the average.
will succeed in finding an incumbent. Hendel (2018) formulate a multi-armed bandit approach to
learn a switching policy online. Neural Diving constructs a new primal heuristic specifically for a
given application. This can be particularly powerful if none of the existing heuristics are well-suited
for the application.
Several works consider learning neighborhood search policies (Addanki et al. 2020, Hottung
and Tierney 2019, Song et al. 2020). Given a feasible point, a learned policy is used to modify
the assignment for a subset of variables such that the resulting feasible assignment has a lower
objective. In these works the learned policy is used to either select the variables to be modified, or
to assign new values to an already selected subset of variables. They require a feasible assignment
as input, while Neural Diving does not.

Page 33
33
10 4
10 2
100
102
Calibrated time (seconds)
0.00
0.25
0.50
0.75
1.00
Fraction of problems solved
with gap <= 0.0
CORLAT
Full Strong
Neural Branching
Neural Branching + Neural Diving (Sequential)
Tuned SCIP
Tuned SCIP + Neural Diving (Sequential)
10 1
100
101
102
103
104
Calibrated time (seconds)
0.00
0.25
0.50
0.75
1.00
Fraction of problems solved
with gap <= 0.05
NN Verification
10 1
100
101
102
103
104
Calibrated time (seconds)
0.0
0.1
0.2
0.3
Fraction of problems solved
with gap <= 0.01
Google Production Packing
10 4
10 2
100
102
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
Fraction of problems solved
with gap <= 0.03
Google Production Planning
102
103
104
Calibrated time (seconds)
0.00
0.25
0.50
0.75
1.00
Fraction of problems solved
with gap <= 0.0001
Electric Grid Optimization
10 1
100
101
102
103
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
Fraction of problems solved
with gap <= 0.0
MIPLIB
Figure 14
Survival plots with respect to calibrated running time for various combinations of primal heuristics
and variable selection policies as a function of running time on the benchmark datasets.
Learning to predict variable values in a MIP and combining the predictions with a solver has
been studied before. Ding et al. (2020) trains a model that takes a MIP as input and predicts
values for a subset of its binary variables. This is then used to define an additional constraint
that any solution be within a pre-specified Hamming distance from the predicted values, which
can significantly reduce the search space and speed up a MIP solver. Training labels are generated
by iteratively improving a feasible assignment and selecting only the subset of binary variables
whose values are stable across iterations as prediction targets. Results are presented on several
synthetic datasets. Xavier et al. (2020) learns to warm start a MIP solver specifically for the electric
grid unit commitment problem. They use a k-Nearest Neighbor binary classifier to predict values
for a subset of binary variables in the MIP. Assuming a fixed grid topology and a fixed set of
power plants allows them to use a MIP representation with constant dimensionality and a pre-

Page 34
34
defined distance metric to identify neighbors. Our work differs significantly from both of these: 1)
the generative modelling formulation provides a principled approach to capturing the uncertainty
in variable values and exploiting it to generate several sub-problems to find high-quality feasible
assignments, 2) it handles general MIPs (varying sizes, non-binary integers), and 3) we present
results on diverse real-world applications.
Learning and MIP Branching Heuristics: Several works have used imitation learning to train
a branching policy to imitate (a variant of) Full Strong Branching. Khalil et al. (2016) learns to
imitate Full Strong Branching for a single instance, using data collected as it is being solved. They
apply a learning-to-rank formulation and handcrafted features. Similarly, Alvarez et al. (2017)
learns a regression model to predict the scores computed by Full Strong Branching for candidate
variables, again using handcrafted features. Neither of these works report current state-of-the-
art results for the learned policies. Neural Branching builds on the work by Gasse et al. (2019),
which trains a Graph Convolutional Network to imitate decisions by Full Strong Branching. The
main differences are 1) our use of the ADMM-based expert to scale up Full Strong Branching to
large instances, and 2) extensive evaluation on real-world applications and significant performance
improvements over SCIP.
More recently, Zarpellon et al. (2020) show that using a representation of the branch-and-bound
tree (rather than that of just a single node) can improve the generalization of a learned branching
policy. Gupta et al. (2020) obtain computational improvements similar to Gasse et al. (2019) in
the setting of CPU-restricted machines by using an expensive graph network only at the root node
of the branch-and-bound tree and a cheaper MLP elsewhere.
Other Learning Approaches for MIP Solvers: Learning has been used to automatically tune
parameters for MIP solvers. Algorithm Configuration (Ansótegui et al. 2009, Hutter et al. 2009,
2011, Ansótegui et al. 2015) aims to find parameter values that improve aggregate performance on
a dataset of instances. For example, Hutter et al. (2011) proposes Sequential Model-based Algo-
rithm Configuration (SMAC), which iterates over three main steps: 1) evaluate the performance of
selected solver parameter values on a set of instances, 2) learn a model to predict the performance
of given parameter values from the evaluations done so far, and 3) use the model learned so far
to select new candidate parameter values to evaluate. Results on CPLEX show that SMAC finds
parameter values that are better than the default and the values found by CPLEX’s tuning tool.
Algorithm Selection (Kotthoff 2016) aims to improve solver parameter values for a specific instance.
For example, Hutter et al. (2014) shows that it is possible to accurately predict the running time of
MIP solvers as a function of the solver parameters and the instance being solved. Such a predictor
can be used to improve the solver parameters for a specific instance. Note that learning techniques

Page 35
35
for Algorithm Configuration and Algorithm Selection are complementary to and can be combined
with the learning of application-specific heuristics.
Learning for Combinatorial Optimization: Our work is an instance of the broader topic of
learning to solve combinatorial optimization problems. Some of the earliest works in this area are
Zhang and Dietterich (1995), Moll et al. (1999), Boyan and Moore (1997). More recently, deep
learning has been applied to the Travelling Salesman Problem (Vinyals et al. 2015, Bello et al.
2016), Vehicle Routing (Kool et al. 2019, Nazari et al. 2018), Boolean Satisfiability (Selsam et al.
2019), and general graph-structured combinatorial optimization problems (Khalil et al. 2017a, Li
et al. 2018). A survey of the topic is available by Bengio et al. (2018).
10. Conclusion
This work has demonstrated the long-held promise of machine learning to significantly improve MIP
solver performance on both large-scale real-world application datasets and MIPLIB. We believe
even bigger improvements are possible with further advances in models and algorithms.
Some promising future directions are:
• Learning to cut: Better selection and generation of cuts using learning is another potential
source of performance improvements. Tang et al. (2020) provide evidence for the usefulness of this
direction.
• Warm-starting models: The strong performance of learned models on MIPLIB suggests that
it is possible to learn heuristics that work well across diverse MIPs. This can be used to overcome
the ‘cold-start’ problem in applications where the amount of training data available early on in
an application’s life cycle may be too small to train good models. We can start by using models
trained on heterogeneous datasets and use them as a bridge to more specialized models as more
data is collected for the application.
• Reinforcement learning: Performance achieved using distillation or behavioral cloning is capped
by the best expert available, while reinforcement learning (RL) can potentially exceed it. Effi-
cient exploration, long range credit assignment, and computational scalability of learning are key
challenges in applying RL to large-scale MIPs. Addressing them can lead to bigger performance
improvements.
11. Acknowledgments
The authors would like to thank
• Alek Andreev, Matko Bosnjak, Elena Buchatskaya, Andrew Cowie, Vali Irimia, Vladimir
Macko, Sona Mokra, Rahul Palamuttam, Alvaro Sanchez, and Kevin Waugh for their help in
implementing various neural network models,

Page 36
36
• David Applegate for advice on Neural Diving and providing the Google Production Packing
dataset,
• Dave Helstroom for help with the Google Production Planning dataset,
• Jakob Bauer, Michael King, Andrea Michi, Richard Tanburn, and Sims Witherspoon for pro-
viding the Electric Grid Optimization dataset,
• Dj Dvijotham and Sven Gowal for providing the Neural Network Verification dataset, and
• Theophane Weber for proofreading the paper and suggesting improvements,
• Miles Lubin for providing help with PySCIPOpt and references for Algorithm Configuration
and Algorithm Selection, and
• Ross Anderson and Juan Pablo Vielma for suggesting improvements to the paper.

Page 37
37
References
Achterberg T (2009) SCIP: solving constraint integer programs. Mathematical Programming Computation
1(1):1–41.
Achterberg T, Koch T, Martin A (2005) Branching rules revisited. Operations Research Letters 33(1):42 –
54.
Achterberg T, Koch T, Martin A (2006) Miplib 2003. Operations Research Letters 34(4):361–372.
Addanki R, Nair V, Alizadeh M (2020) Neural large neighborhood search. Learning Meets Combinatorial
Algorithms NeurIPS Workshop.
Alvarez A, Louveaux Q, Wehenkel L (2017) A machine learning-based approximation of strong branching.
INFORMS Journal on Computing 29:185–195, URL http://dx.doi.org/10.1287/ijoc.2016.0723.
Ansótegui C, Malitsky Y, Samulowitz H, Sellmann M, Tierney K (2015) Model-based genetic algorithms
for algorithm configuration. Proceedings of the 24th International Conference on Artificial Intelligence,
733–739, IJCAI’15 (AAAI Press), ISBN 9781577357384.
Ansótegui C, Sellmann M, Tierney K (2009) A gender-based genetic algorithm for the automatic config-
uration of algorithms. Gent IP, ed., Principles and Practice of Constraint Programming - CP 2009,
142–157 (Berlin, Heidelberg: Springer Berlin Heidelberg).
Bain M, Sammut C (1995) A framework for behavioural cloning. Machine Intelligence 15, 103–129.
Balcan MF, Dick T, Sandholm T, Vitercik E (2018) Learning to branch. volume 80 of Proceedings of
Machine Learning Research, 344–353 (Stockholmsmässan, Stockholm Sweden: PMLR), URL http:
//proceedings.mlr.press/v80/balcan18a.html.
Battaglia PW, Hamrick JB, Bapst V, Sanchez-Gonzalez A, Zambaldi VF, Malinowski M, Tacchetti A,
Raposo D, Santoro A, Faulkner R, Gülçehre C¸, Song F, Ballard AJ, Gilmer J, Dahl GE, Vaswani
A, Allen K, Nash C, Langston V, Dyer C, Heess N, Wierstra D, Kohli P, Botvinick M, Vinyals O,
Li Y, Pascanu R (2018) Relational inductive biases, deep learning, and graph networks. CoRR URL
http://arxiv.org/abs/1806.01261.
Bello I, Pham H, Le QV, Norouzi M, Bengio S (2016) Neural combinatorial optimization with reinforcement
learning. arXiv preprint arXiv:1611.09940 .
Bengio Y, Bengio S (2000) Modeling high-dimensional discrete data with multi-layer neural net-
works. Solla S, Leen T, Müller K, eds., Advances in Neural Information Processing Systems,
volume 12, 400–406 (MIT Press), URL https://proceedings.neurips.cc/paper/1999/file/
e6384711491713d29bc63fc5eeb5ba4f-Paper.pdf.
Bengio Y, Lodi A, Prouvost A (2018) Machine learning for combinatorial optimization: a methodological
tour d’horizon.
Berthold T (2006) Primal heuristics for mixed integer programs. URL https://opus4.kobv.de/opus4-zib/
files/1029/Berthold_Primal_Heuristics_For_Mixed_Integer_Programs.pdf.

Page 38
38
Berthold T (2007) Rens-relaxation enforced neighborhood search .
Berthold T (2013) Measuring the impact of primal heuristics. Operations Research Letters 41(6):611–614.
Bishop CM (2006) Pattern Recognition and Machine Learning (Information Science and Statistics)
(Berlin, Heidelberg: Springer-Verlag), ISBN 0387310738, URL https://www.microsoft.com/en-us/
research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.
pdf.
Boyan JA, Moore AW (1997) Using prediction to improve combinatorial optimization search. In Proc. of 6th
Int’l Workshop on Artificial Intelligence and Statistics.
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning
via the alternating direction method of multipliers. Foundations and Trends® in Machine learning
3(1):1–122.
Boyd S, Vandenberghe L (2004) Convex optimization (Cambridge university press).
Cheng CH, Nührenberg G, Ruess H (2017) Maximum resilience of artificial neural networks. D’Souza D,
Narayan Kumar K, eds., Automated Technology for Verification and Analysis, 251–268 (Springer Inter-
national Publishing).
Conrad J, Gomes CP, van Hoeve WJ, Sabharwal A, Suter J (2007) Connections in networks: Hardness of
feasibility versus optimality. Van Hentenryck P, Wolsey L, eds., Integration of AI and OR Techniques in
Constraint Programming for Combinatorial Optimization Problems, 16–28 (Berlin, Heidelberg: Springer
Berlin Heidelberg).
Dantzig GB (1998) Linear programming and extensions, volume 48 (Princeton university press).
Diamond S, Boyd S (2016) Cvxpy: A python-embedded modeling language for convex optimization. The
Journal of Machine Learning Research 17(1):2909–2913.
Ding J, Zhang C, Shen L, Li S, Wang B, Xu Y, Song L (2020) Accelerating primal solution findings for
mixed integer programs based on solution prediction. AAAI, URL https://www.aaai.org/Papers/
AAAI/2020GB/AAAI-DingJ.6745.pdf.
Domahidi A, Chu E, Boyd S (2013) ECOS: An SOCP solver for embedded systems. European Control
Conference (ECC), 3071–3076.
Eckstein J, Bertsekas DP (1992) On the Douglas—Rachford splitting method and the proximal point algo-
rithm for maximal monotone operators. Mathematical Programming 55(1-3):293–318.
Eckstein J, Nediak M (2007) Pivot, cut, and dive: A heuristic for 0-1 mixed integer programming. Journal
of Heuristics 13(5):471–503.
FICO Xpress (2020) FICO Xpress optimization suite. URL https://www.fico.com/en/products/
fico-xpress-optimization.
Gamrath G, Anderson D, Bestuzheva K, Chen WK, Eifler L, Gasse M, Gemander P, Gleixner A, Gottwald
L, Halbig K, et al. (2020) The SCIP optimization suite 7.0 .

Page 39
39
Gasse M, Chételat D, Ferroni N, Charlin L, Lodi A (2019) Exact combinatorial optimization with graph
convolutional neural networks. Advances in Neural Information Processing Systems, 15554–15566.
Geifman Y, El-Yaniv R (2019) Selectivenet: A deep neural network with an integrated reject option. arXiv
preprint arXiv:1901.09192 .
Glankwamdee W, Linderoth JT (2011) Lookahead branching for mixed integer programming. ICS 2011.
Gleixner A, Hendel G, Gamrath G, Achterberg T, Bastubbe M, Berthold T, Christophel PM, Jarck K, Koch
T, Linderoth J, Lübbecke M, Mittelmann HD, Ozyurt D, Ralphs TK, Salvagnin D, Shinano Y (2019)
MIPLIB 2017: Data-Driven Compilation of the 6th Mixed-Integer Programming Library. Technical
report, Optimization Online, URL http://www.optimization-online.org/DB_FILE/2019/07/7285.
html.
Gomes CP, van Hoeve WJ, Sabharwal A (2008) Connections in networks: A hybrid approach. Perron L,
Trick MA, eds., Integration of AI and OR Techniques in Constraint Programming for Combinatorial
Optimization Problems, 303–307 (Berlin, Heidelberg: Springer Berlin Heidelberg).
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning (MIT Press), http://www.deeplearningbook.
org.
Gupta P, Gasse M, Khalil E, Mudigonda P, Lodi A, Bengio Y (2020) Hybrid models for learning to branch.
Advances in neural information processing systems 33.
Gurobi Optimization L (2020) Gurobi optimizer reference manual. URL http://www.gurobi.com.
Hansen P, Mladenovic N, Pérez JAM (2010) Variable neighbourhood search: Methods and applications.
Annals of Operations Research 175(1):367–407.
He H, Daume III H, Eisner JM (2014) Learning to search in branch and bound algorithms. Ghahra-
mani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, eds., Advances in Neural Information
Processing Systems 27, 3293–3301 (Curran Associates, Inc.), URL http://papers.nips.cc/paper/
5495-learning-to-search-in-branch-and-bound-algorithms.pdf.
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR).
Hendel G (2018) Adaptive large neighborhood search for mixed integer programming. Mathematical Pro-
gramming Computation Under review.
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8):1735–1780.
Hottung A, Tierney K (2019) Neural large neighborhood search for the capacitated vehicle routing problem.
arXiv preprint arXiv:1911.09539 .
Hutter F, Hoos HH, Leyton-Brown K (2011) Sequential model-based optimization for general algorithm con-
figuration. Proceedings of the 5th International Conference on Learning and Intelligent Optimization,
507–523, LION’05 (Springer-Verlag), ISBN 9783642255656.

Page 40
40
Hutter F, Hoos HH, Leyton-Brown K, Stützle T (2009) Paramils: An automatic algorithm configuration
framework. J. Artif. Int. Res. 36(1):267–306, ISSN 1076-9757.
Hutter F, Xu L, Hoos HH, Leyton-Brown K (2014) Algorithm runtime prediction: Methods & evaluation.
Artificial Intelligence 206:79 – 111, ISSN 0004-3702.
IBM ILOG CPLEX (2019) V12.10: User’s manual for CPLEX. URL https://www.ibm.com/analytics/
cplex-optimizer.
Jünger M, Liebling TM, Naddef D, Nemhauser GL, Pulleyblank WR, Reinelt G, Rinaldi G, Wolsey LA (2009)
50 Years of integer programming 1958-2008: From the early years to the state-of-the-art (Springer
Science & Business Media).
Karp RM (1972) Reducibility among combinatorial problems. Complexity of computer computations, 85–103
(Springer).
Khalil E, Dai H, Zhang Y, Dilkina B, Song L (2017a) Learning combinatorial optimization algorithms over
graphs. Advances in Neural Information Processing Systems, 6348–6358.
Khalil E, Le Bodic P, Song L, Nemhauser G, Dilkina B (2016) Learning to branch in mixed integer program-
ming. Schuurmans D, Wellman M, eds., Proceedings of the Thirtieth AAAI Conference on Artificial
Intelligence, 724–731 (United States of America: Association for the Advancement of Artificial Intel-
ligence (AAAI)), URL http://www.aaai.org/Conferences/AAAI/aaai16.php, aAAI Conference on
Artificial Intelligence 2016, AAAI 16 ; Conference date: 12-02-2016 Through 17-02-2016.
Khalil EB, Dilkina B, Nemhauser GL, Ahmed S, Shao Y (2017b) Learning to run heuristics in tree search.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17,
659–666, URL http://dx.doi.org/10.24963/ijcai.2017/92.
Kingma D, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kingma DP, Welling M (2014) Auto-encoding variational bayes. Bengio Y, LeCun Y, eds., 2nd Interna-
tional Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014,
Conference Track Proceedings, URL http://arxiv.org/abs/1312.6114.
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint
arXiv:1609.02907 .
Knueven B, Ostrowski J, Watson JP (2018) On mixed integer programming formulations for the unit com-
mitment problem. Optimization Online Repository 2018, URL http://www.optimization-online.
org/DB_FILE/2018/11/6930.pdf.
Kool W, van Hoof H, Welling M (2019) Attention, learn to solve routing problems! International Conference
on Learning Representations, URL https://openreview.net/forum?id=ByxBFsRqYm.
Kotthoff L (2016) Algorithm Selection for Combinatorial Search Problems: A Survey, 149–190 (Cham:
Springer International Publishing).

Page 41
41
Land A, Doig A (1960) An automatic method of solving discrete programming problems. Econometrica
28(3):497–520.
Lawler EL, Wood DE (1966) Branch-and-bound methods: A survey. Operations research 14(4):699–719.
Lei Ba J, Kiros JR, Hinton GE (2016) Layer Normalization. arXiv e-prints arXiv:1607.06450.
Li Z, Chen Q, Koltun V (2018) Combinatorial optimization with graph convolutional networks and guided
tree search. Advances in Neural Information Processing Systems, 539–548.
Lions PL, Mercier B (1979) Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on
Numerical Analysis 16(6):964–979.
Lodi A, Tramontani A (2014) Performance Variability in Mixed-Integer Programming, chapter Chapter 1,
1–12.
Maher S, Miltenberger M, Pedroso JP, Rehfeldt D, Schwarz R, Serrano F (2016) PySCIPOpt: Mathematical
programming in python with the SCIP optimization suite. Mathematical Software – ICMS 2016, 301–
307 (Springer International Publishing), URL http://dx.doi.org/10.1007/978-3-319-42432-3_37.
Mladenovic N, Hansen P (1997) Variable neighborhood search. Computers & operations research
24(11):1097–1100.
Moll R, Barto AG, Perkins TJ, Sutton RS (1999) Learning instance-independent value functions
to enhance local search. Kearns MJ, Solla SA, Cohn DA, eds., Advances in Neural Infor-
mation Processing Systems 11, 1017–1023 (MIT Press), URL http://papers.nips.cc/paper/
1573-learning-instance-independent-value-functions-to-enhance-local-search.pdf.
Nazari M, Oroojlooy A, Snyder L, Takac M (2018) Reinforcement learning for solving the vehicle routing
problem. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds., Advances in
Neural Information Processing Systems, volume 31, 9839–9849 (Curran Associates, Inc.), URL https:
//proceedings.neurips.cc/paper/2018/file/9fb4651c05b2ed70fba5afe0b039a550-Paper.pdf.
O’Donoghue B (2020) Operator splitting for a homogeneous embedding of the monotone linear complemen-
tarity problem. arXiv preprint arXiv:2004.02177 .
O’Donoghue B, Chu E, Parikh N, Boyd S (2016) Conic optimization via operator splitting and homogeneous
self-dual embedding. Journal of Optimization Theory and Applications 169(3):1042–1068, URL http:
//stanford.edu/~boyd/papers/scs.html.
O’Neill R (2017) Computational issues in iso market models .
Pomerleau DA (1989) Alvinn: An autonomous land vehicle in a neural network. Advances in neural infor-
mation processing systems, 305–313.
Ross S, Gordon G, Bagnell D (2011) A reduction of imitation learning and structured prediction to no-
regret online learning. Proceedings of the fourteenth international conference on artificial intelligence
and statistics, 627–635.

Page 42
42
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model.
IEEE Transactions on Neural Networks 20(1):61–80.
Schubert C (2017) Multi-level lookahead branching. Technical report, Institute of Mathematics, TU Berlin.
Selsam D, Lamm M, Bünz B, Liang P, de Moura L, Dill DL (2019) Learning a SAT solver from single-bit
supervision. 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA,
USA, May 6-9, 2019, URL https://openreview.net/forum?id=HJMC_iA5tm.
Shaw P (1998) Using constraint programming and local search methods to solve vehicle routing problems.
International conference on principles and practice of constraint programming, 417–431 (Springer).
Shinano Y, Achterberg T, Berthold T, Heinz S, Koch T (2011) Parascip: a parallel extension of scip. Com-
petence in High Performance Computing 2010, 135–148 (Springer).
Sierksma G, Zwols Y (2015) Linear and integer optimization: Theory and practice (CRC Press).
Song J, Lanka R, Yue Y, Dilkina B (2020) A general large neighborhood search framework for solving integer
programs. arXiv preprint arXiv:2004.00422 .
Taha HA (2014) Integer programming: theory, applications, and computations (Academic Press).
Tang Y, Agrawal S, Faenza Y (2020) Reinforcement learning for integer programming: Learning to
cut. III HD, Singh A, eds., Proceedings of the 37th International Conference on Machine Learn-
ing, volume 119 of Proceedings of Machine Learning Research, 9367–9376 (Virtual: PMLR), URL
http://proceedings.mlr.press/v119/tang20a.html.
Tjeng V, Xiao KY, Tedrake R (2019) Evaluating robustness of neural networks with mixed integer program-
ming. International Conference on Learning Representations, URL https://openreview.net/forum?
id=HyGIdiRqtm.
Vinyals O, Fortunato M, Jaitly N (2015) Pointer networks. Cortes C, Lawrence N, Lee D,
Sugiyama M, Garnett R, eds., Advances in Neural Information Processing Systems, volume 28,
2692–2700 (Curran Associates, Inc.), URL https://proceedings.neurips.cc/paper/2015/file/
29921001f2f04bd3baee84a12e98098f-Paper.pdf.
Wolsey L (1998) Integer Programming. Wiley Series in Discrete Mathematics and Optimization (Wiley),
ISBN 9780471283669, URL https://books.google.co.uk/books?id=x7RvQgAACAAJ.
Xavier AS, Qiu F, Ahmed S (2020) Learning to solve large-scale security-constrained unit commitment
problems. INFORMS Journal on Computing .
Xu K, Li C, Tian Y, Sonobe T, Kawarabayashi Ki, Jegelka S (2018) Representation learning on graphs with
jumping knowledge networks. International Conference on Machine Learning.
Yang Y, Boland N, Dilkina B, Savelsbergh M (2020) Learning generalized strong branching for set covering,
set packing, and 0-1 knapsack problems. Technical report, URL http://www.optimization-online.
org/DB_HTML/2020/02/7626.html.

Page 43
43
Yang Y, Boland N, Savelsbergh M (2019) Multi-variable branching: A case study with 0-1 knapsack problems.
Technical report, URL http://www.optimization-online.org/DB_HTML/2019/10/7431.html.
Zarpellon G, Jo J, Lodi A, Bengio Y (2020) Parameterizing branch-and-bound search trees to learn branching
policies. arXiv preprint arXiv:2002.05120 .
Zhang W, Dietterich TG (1995) A reinforcement learning approach to job-shop scheduling. Proceedings of the
14th International Joint Conference on Artificial Intelligence - Volume 2, 1114–1120, IJCAI’95 (San
Francisco, CA, USA: Morgan Kaufmann Publishers Inc.), ISBN 1558603638.

Page 44
44
12. Appendix
In the appendix we provide supplementary information which can be grouped into four strands:
1. An autoregressive model approach for Neural Diving, which produces promising initial results,
but requires too much further investigation to warrant its inclusion in the main text.
2. Further data shedding light onto the previously described Neural Brancher work from a
slightly different perspective, namely analysis results with respect to the number of nodes in the
branch-and-bound tree, and also PAR-k metrics.
3. Additional details on the ADMM batch LP solver.
4. Extended description of the datasets we used throughout the paper and of the concept of
calibrated time we used to mitigate the problems arising from trying to measure solving times on
large clusters of heterogeneous machines.
12.1. Autoregressive Models for Neural Diving
Fairly simple conditionally-independent generative models may provide enough capacity for certain
problems, especially when optimal solution is simple to predict. However, when the training set
does not contain a dense region of high-quality solutions, such a model might ultimately fail to
adequately express uncertainty and thus either learn to predict an overly narrow sub-region of the
solution space or output spread probability mass over many infeasible solutions.
One well-proven way of overcoming limitations of conditionally-independent models is to model
the desired distribution in auto-regressive way. Such a model would predict variable values sequen-
tially based on some pre-defined ordering. We tried a number of simple extensions of the basic
conditionally-independent model such as LSTM Hochreiter and Schmidhuber (1997) on top of the
pre-computed variable node embedding, but found this kind of auto-regressive dependencies not
expressive enough to provide a clear advantage over the conditionally-independent model.
Another direction we explored was to only provide auto-regressive dependencies through incre-
mentally solving the underlying LP problem as we sample and assign variable values. Thus, we first
compute the variable node embeddings {vd}D
d=1 (see section 3.2 for details). Then defining ˜x(x<d)
to be an LP-solution of M with the first d − 1 variables assigned to particular values, we predict
the value of xd in the following way:
pθ(x|M) =
D
d=1
p(xd|xd−1,...,x1,M) =
D
d=1
Bernoulli(xdd),
(22)
µd = MLP(vd, ˜xd(x<d);θ).
(23)
In practice, we do not re-compute the LP-solution after assigning each of the variables and only
do so after every K (K = 100 in the experiment below) steps by warm-starting the LP solver with
the last state of the previous solution process.

Page 45
45
102
103
104
Calibrated time (seconds)
0.0
0.1
0.2
0.3
Average primal gap
Neural Diving (parallel)
Neural Diving (sequential)
Neural Diving, Incremental LP (parallel)
Neural Diving, Incremental LP (sequential)
102
103
104
Calibrated time (seconds)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of problems
with average primal gap <= 0.01
Neural Diving (parallel)
Neural Diving (sequential)
Neural Diving, Incremental LP (parallel)
Neural Diving, Incremental LP (sequential)
Figure 15
Comparison between the standard conditionally-independent GNN and its autoregressive version with
incremental LP solver on Google Production Packing dataset.
Figure 15 contains a comparison between the conditionally-independent and the autoregressive
models on Google Production Packing dataset. One can see, that information encoded in incremene-
tal LP solutions does bring an improvement in terms of the average primal gap and the number of
solutions solved to optimality. However, most of the gains brought by higher-quality assignments
are outweighed by increased sampling time since sampling can no longer be done in parallel for each
of the variables. These preliminary results indicate that there is potential in further architectural
advances and leveraging more advanced features, which we leave for future work.
12.1.1. Ordering of variables In many domains, there is often a reasonably natural way
to order variables in an autoregressive model. In the case of audio and other temporal sequences,
time defines an order. In the case of images, raster scan ordering of pixels is commonly used.
In the case of integer programs, we need a way to order the binary variables. This ordering should
be invariant to row and column permutations of the integer program since such changes do not
affect the integer program. Intuitively we want a ”canonical” ordering that places variables that
are ”similar” across instances at similar positions in the ordering so that autoregressive models can
learn effectively. Unfortunately, there is no universal canonical ordering of variables that works well
for all MIPs. Instead, we list several possible ordering methods as an additional hyperparameter,
and pick the one that works best during hyperparameter sweeps. The orderings that we consider
are:
1. Input order: We can proceed through the variables in the order as they are given in the
MIP definition. Often, the ordering in which the variables are written down in the input already
places ”similar” variables close to each other.
2. Coefficient order: We can sort variables by the coefficient by which they contribute to the
objective. In some use cases, the objective coefficient of a variable encodes a semantic meaning of
importance or priority, and we can hope to use this to e.g. process the most important variables
first.

Page 46
46
100
101
102
103
104
105
106
107
Number of nodes
10 4
10 3
10 2
Average Dual Gap
CORLAT
Uniform Random
Target Policy
Neural Brancher
(a) CORLAT Training Set
100
101
102
103
104
105
106
107
Number of nodes
10 3
10 2
Average Dual Gap
CORLAT
Uniform Random
Target Policy
Neural Brancher
(b) CORLAT Test Set
100
101
102
103
104
105
106
107
Number of nodes
10 2
10 1
Average Dual Gap
NN Verification
Uniform Random
Target Policy
Neural Brancher
(c) Neural Network Verification Training Set
100
101
102
103
104
105
106
107
Number of nodes
10 6
10 5
10 4
10 3
10 2
10 1
Average Dual Gap
NN Verification
Uniform Random
Target Policy
Neural Brancher
(d) Neural Network Verification Test Set
Figure 16
Comparison of the average dual gap achieved by the target (expert) policy and the learned policy
on training and test sets for CORLAT and Neural Network Verification. We also include the uniform
random policy as a baseline.
3. Fractionality order: We can order variables by their fractionality in the LP solutions. The
fractionality of a variable is also often used to determine variable order in diving heuristics in the
literature (see, e.g., section 3.1 of Berthold (2006)). Empirically we found this ordering to perform
the best with autoregressive models.
12.2. Imitation Accuracy of the Learned Branching Policy
How well does the learned branching policy approximate the expert policy that it is trained to
imitate? Figure 16 compares the average dual gap achieved by the expert policy and the learned
policy on both the training and test sets for CORLAT and Neural Network Verification. We
also include a policy that takes actions uniformly randomly as a baseline. The results show that
imitation learning succeeds in accurately approximating the expert policy both on training and test
sets. The learned policy is much closer to the expert than it is to the uniform random policy. The
similarity in performance between training and test sets show that the learned policy generalizes
well to unseen instances.

Page 47
47
100
101
102
103
104
105
Number of nodes
0.00
0.25
0.50
0.75
1.00
Fraction of problems solved
with gap <= 0.0
CORLAT
Full Strong
Neural Branching
Neural Branching + Neural Diving (Sequential)
Tuned SCIP
Tuned SCIP + Neural Diving (Sequential)
100
101
102
103
104
105
106
Number of nodes
0.00
0.25
0.50
0.75
1.00
Fraction of problems solved
with gap <= 0.05
NN Verification
100
101
102
103
104
105
Number of nodes
0.0
0.1
0.2
0.3
Fraction of problems solved
with gap <= 0.01
Google Production Packing
101
103
105
107
Number of nodes
0.0
0.2
0.4
0.6
Fraction of problems solved
with gap <= 0.03
Google Production Planning
100
101
102
103
104
105
Number of nodes
0.00
0.25
0.50
0.75
1.00
Fraction of problems solved
with gap <= 0.0001
Electric Grid Optimization
101
103
105
107
Number of nodes
0.0
0.2
0.4
0.6
Fraction of problems solved
with gap <= 0.0
MIPLIB
Figure 17
Survival plots with respect to number of nodes for various combinations of primal heuristics and
variable selection policies as a function of running time on the benchmark datasets.
12.3. Evaluation Results with Respect to Number of Nodes
Sections 7.4 and 8 present evaluation results for Neural Branching and the combined Neural
Branching + Neural Diving solver with respect to calibrated running time. Here we present the
corresponding results with respect to number of nodes. They are shown in Figure 18. While run-
ning time is often the practically relevant resource metric, evaluating with respect to number of
nodes can provide additional insights by removing the effect of different computational costs and
hardware for the solvers being compared.
12.4. PAR-10 Results
In Table 3 we present Penalized Average Running time results with a penalty factor of 10 (PAR-
10). PAR-10 averages over a set of instances the time taken to achieve the target optimality gap,

Page 48
48
100
101
102
103
104
105
Number of nodes
10 4
10 2
100
Average Primal-Dual Gap
CORLAT
Full Strong
Neural Branching
Neural Branching + Neural Diving (Sequential)
Tuned SCIP
Tuned SCIP + Neural Diving (Sequential)
100
101
102
103
104
105
106
Number of nodes
10 6
10 4
10 2
100
Average Primal-Dual Gap
NN Verification
100
101
102
103
104
105
Number of nodes
10 1
100
Average Primal-Dual Gap
Google Production Packing
101
103
105
107
Number of nodes
10 1
100
Average Primal-Dual Gap
Google Production Planning
100
101
102
103
104
105
Number of nodes
10 3
10 2
10 1
100
Average Primal-Dual Gap
Electric Grid Optimization
101
103
105
107
Number of nodes
10 1
100
Average Primal-Dual Gap
MIPLIB
Figure 18
Average primal-dual gap achieved on the test set by various solvers as a function of the number of
nodes.
with timed out instances being assigned a running time equal to the time limit multiplied by the
penalty factor. In our setting we use a time limit of 10000. As can seen in the Table 3 the winner
for four of the five datasets with the largest MIPs is a Neural Solver, supporting our claim that
learning is beneficial.
12.5. ADMM Batch LP Solver
In this section we give more details about the ADMM batch LP solver we use to generate the
expert training data that the neural Branching is trained on. Consider the following composite
convex optimization problem
minimize f(x)+g(z)
subject to x = z

Page 49
49
Solver
CORLAT
NN Verifi-
cation
Google
Production
Packing
Electric
Grid Opti-
mization
Google
Production
Planning
MIPLIB
Full Strong
21.33
120.86
92389.16
39982.2
50899.31
50097.96
Tuned SCIP
6.10
1128.74
77338.42
4398.12
42148.14
32818.75
Neural Branching 76.81
21.72
99793.69
2311.02
58392.5
45637.33
Tuned SCIP +
Neural
Diving
(Sequential)
6.93
1499.49
72032.52
7001.28
48225.51
27762.1
Neural Branching
+ Neural Diving
(Sequential)
7.35
10.20
70026.23
2450.8
48819.25
36383.16
The PAR-10 of
the
Winner
divided by Tuned
SCIP’s
1
0.009
0.9055
0.5255
1
0.8459
Table 3
PAR-10 results with respect to calibrated time. Lower is better. In bold is the best one for each dataset.
over variables x ∈ Rn, z ∈ Rn and where f : Rn → R ∪ {∞}, g : Rn → R ∪ {∞} are closed, proper,
convex functions. The Alternating direction method of multipliers (ADMM) (Boyd et al. 2011,
Lions and Mercier 1979, Eckstein and Bertsekas 1992) applied to this problem is the following
procedure for any ρ > 0:
xk+1 = arg minx
(
f(x)+(ρ/2)x−zk −λk 2
2
)
zk+1 = arg minz
(
g(z)+(ρ/2)xk+1 −z −λk 2
2
)
λk+1 = λk +xk+1 −zk+1.
Under benign conditions this procedure produces a sequence of iterates that converge to optimal
objective values, i.e., f(xk)+g(zk) → f(x)+g(z), and xk −zk
→ 0, where xand zare optimal
primal points (Boyd et al. 2011, §3.2). We can write the LP relaxation of MIP (1) in a form more
amenable to ADMM:
minimize cT x+IAx=y(x, y)+I[bl,bu](˜y)+I[l,u](˜x)
subject to x = ˜x, y = ˜y,
over variables x, ˜x ∈ Rn, y, ˜y∈ Rm, and where IX denotes the convex indicator function of set X,
i.e.,
IX (z) =
0
z ∈ X
otherwise.

Page 50
50
When ADMM is applied to this problem it yields the following algorithm with iterates x, ˜x ∈ Rn,
y, ˜y∈ Rm, λ ∈ Rm+n:
(xk+1,yk+1) = arg minAx=y
cT x+ρ/2
x
y
−
˜xk
˜yk
−
λk
x
λk
y
2
2
˜xk+1
= Π[l,u](xk+1 −λk
x)
˜yk+1
= Π[bl,bu](yk+1 −λk
y)
λk+1
x
λk+1
y
=
λk
x
λk
y
+
xk+1
yk+1
−
˜xk+1
˜yk+1
,
where Π[l,u](z) denotes the Euclidean projection of z onto the interval [l,u] elementwise. It turns
out that only first step is computationally challenging since it involves solving a (potentially large)
set of linear equations. Concretely, for some right-hand side r ∈ Rm+n we want to solve
(xk+1,yk+1) = arg min
Ax=y
cT x+ρ/2
x
y
−
rk
x
rk
y
2
2
.
The solution to the which is given by
xk+1
yk+1
 =
I
AT
A
−I
−1
rk
x −c/ρ
rk
y
+
0
rk
y
.
Note that the matrix to be inverted is the same for all iterations. Not only that, but the matrix
does not depend on which variable we are branching on; it is the same for all LPs we need to solve
at any given node. This means we can batch and solve all the LPs (or however many will fit into
the GPU memory) together on the GPU simultaneously. This yields significant hardware benefits
for both direct and indirect approaches to solving the linear system. Although GPUs are designed
to provide very fast dense matrix-matrix routines the sparse matrix operations make less efficient
uses of the GPU stream parallelism. However, by batching sparse matrix-vector operations into
matrix-matrix, then it is easier to keep the GPU busy and make better use of the parallelism. We
shall see that this yields very large speedups in practice.
12.5.1. GPU speedups Here we present some speedup results for our batching strategy on
a single GPU. We shall consider all possible LPs for all branching candidates at the root nodes of
all the MIPs in the MIPLIB dataset. In all cases we ran the ADMM solver for 100 iterations and
used the LP solution of the root node as a warm-start with which to initialize the ADMM solver.
First, as a concrete example, consider the air05 binary MIP which at the root node (after presolve
and any reductions that happens at the root node) has 6.1k variables, 426 constraints. The matrix
that defines the problem is 98% sparse with 43.4k non-zeros. This high sparsity is not amenable

Page 51
51
0
2000
4000
6000
8000
10000
12000
Number of LPs in batch
2
4
6
8
10
Solve time batch / time for 1 LP
102
103
Speedup factor
0
20
40
60
80
100
120
140
Number of LPs
Figure 19
Left: Solve time vs number of LPs in a batch for the air05 MIP. Right: Histogram of speedup factors
for all MIPs in MIPLIB at root nodes.
to acceleration on GPU, however, when batched the solver can make much better use of the GPU
parallelism. In Figure 19 we show the time required to approximately solve the LPs vs batch-size
on GPU, normalized by time for (approximately) solving a single solver using the same solver. We
show that we can compute all 12.2k branch scores in roughly 10 times the cost of a single solve,
which is approximately a 1200× speedup from batching relative to solving all the LPs sequentially.
As a point of comparison, for the same set of LPs ECOS (Domahidi et al. 2013) takes about 4.5
seconds to solve each one (which is approximately the same as SOPLEX, the built-in LP solver
in SCIP, which takes 3 seconds to solve the LP at the root) so to solve all 12k LPs would require
more than 15 hours. This is far too long for use in an industrial MIP solver and significantly larger
than the ADMM batch LP solver, which finds the (approximate) solutions for all the LPs in 5.4
seconds. Next we show that this pattern holds true for the entire MIPLIB dataset. In Figure 19 we
show the histogram of this speedup across the root nodes of the entire MIPLIB dataset. If there
are n LPs to be solved at the root node, then we define the speedup factor to be
speedup =
n×(time to solve 1 LP)
time to solve n LPs in batch
.
We remove the effect of the fixed costs (e.g., transferring the data to the GPU) by subtracting
the time taken to run zero iterations, which is non-zero, from the denominator and numerator
(if we didn’t remove this time it would artificially inflate the speedup factors we would observe).
This figure shows that the speedup is significant across the board, averaging approximately 800×,
almost three orders of magnitude. Typically larger MIPs yield larger speedups, which also explains
the sudden drop off at around 1000× speedup, since the largest MIPs eventually become too large
for the GPU memory to handle.
12.5.2. Accuracy Warm-starting from the solution at the previous node means not many
iterations are required for reasonable accuracy, which is a typical property of ADMM. This justifies
our use of a relatively small number of ADMM iterations to ‘solve’ the problem. In Figure 20 we

Page 52
52
0
100
200
300
400
500
ADMM iterations
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Objectives correlation
20500
21000
21500
22000
22500
23000
23500
Optimal objective
20500
21000
21500
22000
22500
23000
ADMM objcetive at 100 iters
Figure 20
Correlation of ADMM objective and true objectives at root of air05 MIP.
plot two quantities at the root node of the air05 MIP, at which we need to solve approximately
12k LPs. On the left hand side we plot the correlation of the 12k true objective values with the
objective values returned by ADMM after 100 iterations. The true objective values were found by
solving the same LPs using the high-accuracy ECOS interior point solver (Domahidi et al. 2013)
with parser-solver CVXPY (Diamond and Boyd 2016). As we can see it converges towards one as
the number of iterations increases, but even for a handful of iterations the approximate objective
values are well correlated with the true objective values. Since the branch and bound decisions are
made using the objective values only (see (19)) this high correlation indicates that we have a strong
signal for the branch and bound decision with relatively little computational effort, and typically
only low accuracy LP objective values are required for use in FSB (Achterberg et al. 2005). On
the right we show the scatterplot of the true objective values and the objective values returned by
ADMM after 100 iterations, and we can see the high correlation visually.
12.6. Details of Datasets
The target optimality gap used for each dataset in our evaluation as SCIP’s stopping criterion for
a MIP solve is given in table 4. In the case of Neural Network Verification, the actual criterion used
in the application is to stop when the objective value becomes negative, but this is not expressible
as a constant target gap across all instances. In order to treat all datasets consistently, we have
selected a gap of 0.05.
Figures 21 and 22 show the MIP sizes for the datasets used in our evaluation with and without
presolving using SCIP 7.0.1. Note that, among the application-specific datasets, only Google Pro-
duction Planning contains non-binary integer variables. It is also the most heterogeneous (along
with MIPLIB) in terms of instance sizes. Table 5 summarizes the characteristics of those datasets
before and after presolving with the SCIP 7.0.1 solver. The number of constraints and variables
ranges different orders of magnitude across the datasets. It is worth noting that during presolving
some instances might be deemed infeasible and, hence, dropped from the dataset.

Page 53
53
Table 4
Target optimality gaps for the datasets in our evaluation used as the stopping criterion for SCIP.
Dataset
Target Optimality Gap
CORLAT
0
Neural Network Verification
0.05
Google Production Packing
0.01
Google Production Planning
0.03
Electric Grid Optimization
0.0001
MIPLIB
0
Table 5
Median of the number of constraints and variables (per type) for the different datasets before / after
presolving using SCIP 7.0.1.
Dataset
Constraints
Variables
Binary
Integer (non-binary) Continuous
CORLAT
470/454
466/458
100/97
0/0
366/361
NN Verification
6531/1407
7142/1629
171/170
0/0
6972/1455
Google Production Packing 36905/24495
10046/8919
3773/3437
0/0
6231/5421
Google Production Planning
11910/478
9884/404
833/119
462/119
8337/136
Electric Grid Optimization 61851/45834 60720/58389 42240/42147
0/0
18768/16623
MIPLIB
7706/4388
11090/9629
4450/3816
0/0
218/96
We additionally report some key network statistics for the bipartite graphs that represent the
resulting MIPs of the presolved instances. These include: a) the average degree across all nodes, b)
the density of the bipartite graph, i.e. the number of present edges over the number of all possible
edges between the nodes, c) the maximum degree centrality for the variables nodes, and d) the
diameter of the graph, i.e. the longest among all shortest paths between any pair of nodes. The
normalized histograms for the graph characteristics are presented in Figure 23.
As mentioned in the main paper, the above datasets were randomly split into training, val-
idation and test sets with sizes 70%, 15% and 15%, respectively. This split is handled dif-
ferently for MIPLIB as it constitutes an externally defined ‘Benchmark Set’1. This set con-
tains 240 MIPs on which benchmarking results are reported for commercial and open source
solvers and, therefore, constitutes our test set. The remaining instances in MIPLIB2017 and
MIPLIB2010 correspond to our training and validation sets, respectively. Specifically, we use as
the training set those instances from the MIPLIB2017 ‘Collection Set’2 which do not belong
to the ‘Benchmark Set’. Similarly, we use as validation set the instances from MIPLIB2010
which do not belong to MIPLIB2017. This process yields 825 training, 155 validation and
240 test non-overlapping MIPs. The CORLAT dataset is available at https://bitbucket.org/

Page 54
54
101
102
103
104
105
106
107
Number of variables
101
102
103
104
105
106
107
Number of constraints
Google Production Planning
CORLAT
Neural Network Verification
Google Production Packing
Electric Grid Optimization
101
102
103
104
105
106
107
Number of variables
101
102
103
104
105
106
107
Number of constraints
MIPLIB
Figure 21
Original number of variables and constraints (before presolving) for the datasets used in our
evaluation. Application-specific datasets are shown on the left, and MIPLIB is shown on the
right.
101
102
103
104
105
Number of binary variables
101
102
103
104
105
Number of constraints
Google Production Planning
CORLAT
Neural Net Verification
Google Production Packing
Electric Grid Optimization
102
103
104
Number of non-binary integer variables
101
102
103
104
105
Number of constraints
Google Production Planning
CORLAT
Neural Net Verification
Google Production Packing
Electric Grid Optimization
Figure 22
Number of binary variables (left) and non-binary integer variables (right) versus number of
constraints after presolving using SCIP 7.0.1 for the application-specific datasets used in our
evaluation.
mlindauer/aclib2/src/master/. The MIPLIB dataset (with the aforementioned tags) is available
at https://miplib.zib.de/download.html. The Neural Network Verification dataset is available
at https://github.com/deepmind/deepmind-research/tree/master/neural_mip_solving.

Page 55
55
5.0
7.5
0.0
0.5
degree_mean
0.00
0.25
0.0
0.5
density
0.25
0.50
0.75
0.0
0.5
degree_centrality_max
5.0
7.5
0.0
0.5
diameter
30
40
50
0.0
0.1
0.02
0.04
0.0
0.2
0.2
0.4
0.0
0.1
12.5
15.0
0.0
0.5
20
30
0.0
0.1
0.002
0.003
0.0
0.2
0.025
0.030
0.0
0.1
8
10
0.0
0.5
2.25
2.50
0.0
0.5
0.00
0.01
0.0
0.5
0.00
0.05
0.0
0.2
50
75
0.0
0.2
5.25
5.50
5.75
0.0
0.1
0.00012
0.00014
0.0
0.1
0.0045 0.0050 0.0055
0.0
0.1
18
20
22
0
1
0
200
0.0
0.5
0
1
0.0
0.5
0
1
0.0
0.5
0
100
0.0
0.5
value
fraction
CORLAT
Neural Network Verification
Google Production Packing
Google Production Planning
Electric Grid Optimization
MIPLIB
Figure 23
Network statistics for the presolved datasets.
12.7. Details of Calibrated Time
As explained in section 5, we use calibrated time to accurately measure running time of evaluation
solve tasks on a shared, heterogeneous compute cluster. For each solve task, we periodically solve
a small calibration MIP on a different thread as the solve task on the same machine. We define
the speed of the machine to be
Speed =
1
Wall clock time to solve calibration MIP
.
(24)

Page 56
56
For each periodic measurement of calibrated time, we estimate the speed K times and use the
average. K is set to be the number of samples needed to estimate mean speed with 95% confidence,
with a minimum of 3 samples and a maximum of 30. The elapsed calibrated time ∆tcalibrated since
the last measurement is
∆tcalibrated = Speed×∆twallclock,
(25)
where ∆twallclock is the elapsed wallclock time since the last measurement. We use the MIP named
vpm2 from MIPLIB2003 Achterberg et al. (2006) as the calibration MIP.
Note that the above definition of calibrated time does not have a time unit. Instead it is (in effect)
a count of the calibrated MIP solves during the evaluation solve task. To give it a unit of seconds,
one can choose a reference machine with respect to which we want to report evaluation solve times,
accurately measure the calibration MIP’s solve time on it (without other tasks interfering), and
multiply the calibrated time in equation 25 by the reference machine’s estimated calibration MIP
solve time. The resulting quantity has a unit of seconds. It can be interpreted as the time the
evaluation solve task would have taken if it ran on the reference machine. We select Intel Xeon
3.50GHz CPU with 32GB RAM as the reference machine. We estimate the mean solving time of
the calibration MIP vpm2 for SCIP 7.0.1 on a single core to be 1.989 seconds (estimated from 1000
samples, collected after a warm up solve which is discarded). All the calibrated time results in the
paper are expressed with respect to the reference machine in seconds.
To illustrate the benefit of calibrated time, table 6 presents results for four instances from
MIPLIB 2017 (Gleixner et al. 2019) comparing 1) wallclock time measurement on a shared cluster
machine, 2) calibrated time measurement on a shared cluster machine, and 3) wallclock time
measurement on the reference machine. We solve a MIP to optimality using SCIP 7.0.1 1000 times
for each of these three settings. We then compute the coefficient of variation as the sample standard
deviation divided by the sample mean for the 1000 measurements. Calibrated time reduces the
coefficient of variation by about 30×, 8.3×, 7×, and 1.5×, on air05, n5-3, swath1, and dano3 3,
respectively, compared to wallclock time measurements, and brings it closer to that of the reference
machine.

Page 57
57
Table 6
Coefficient of variation for different time measurements on MIPs from MIPLIB 2017.
MIP
Shared cluster Shared cluster Reference machine
wallclock time calibrated time
wallclock time
coeff. of var.
coeff. of var.
coeff. of var.
air05
0.979
0.032
0.006
dano3 3
0.224
0.144
0.006
n5-3
0.318
0.038
0.007
swath1
0.623
0.087
0.010