Towards Lower Bounds on the Depth
of ReLU Neural Networks thanks: Authors’ accepted manuscript; to appear in the SIAM Journal on Discrete Mathematics. A preliminary conference version appeared in the proceedings of the NeurIPS 2021 conference. We thank the anonymous referees of both the journal and the conference version for their insightful comments which helped to improve the presentation and clarity. Christoph Hertrich gratefully acknowledges funding by DFG-GRK 2434 “Facets of Complexity”. Amitabh Basu gratefully acknowledges support from AFOSR Grant FA95502010341 and NSF Grant CCF2006587. Martin Skutella gratefully acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy — The Berlin Mathematics Research Center MATH+ (EXC-2046/1, project ID: 390685689).

Christoph Hertrich London School of Economics and Political Science, London, UK,
[email protected]
Amitabh Basu Johns Hopkins University, Baltimore, USA,
[email protected]
Marco Di Summa Università degli Studi di Padova, Padua, Italy,
[email protected]
Martin Skutella Technische Universität Berlin, Berlin, Germany,
[email protected]
Abstract

We contribute to a better understanding of the class of functions that can be represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning any function. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). As a by-product of our investigations, we settle an old conjecture about piecewise linear functions by Wang and Sun [74] in the affirmative. We also present upper bounds on the sizes of neural networks required to represent functions with logarithmic depth.

1 Introduction

A core problem in machine learning and statistics is the estimation of an unknown data distribution with access to independent and identically distributed samples from the distribution. It is well-known that there is a tension between the expressivity of the model chosen to approximate the distribution and the number of samples needed to solve the problem with high confidence (or equivalently, the variance one has in one’s estimate). This is referred to as the bias-variance trade-off or the bias-complexity trade-off. Neural networks provide a way to turn this bias-complexity knob in a controlled manner that has been studied for decades going back to the idea of a perceptron by Rosenblatt [62]. This is done by modifying the architecture of a neural network class of functions, in particular its size in terms of depth and width. As one increases these parameters, the class of functions becomes more expressive. In terms of the bias-variance trade-off, the “bias” decreases as the class of functions becomes more expressive, but the “variance” or “complexity” increases.

So-called universal approximation theorems [5, 18, 40] show that even with a single hidden layer, that is, when the depth of the architecture achieves its smallest possible value, one can essentially reduce the “bias” as much as one desires, by increasing the width. Nevertheless, it can be advantageous both theoretically and empirically to increase the depth because a substantial reduction in the size can be achieved by this [6, 21, 46, 63, 69, 70, 75]. To get a better quantitative handle on these trade-offs, it is important to understand what classes of functions are exactly representable by neural networks with a certain architecture. The precise mathematical statements of universal approximation theorems show that single layer networks can arbitrarily well approximate any continuous function (under some additional mild hypotheses). While this suggests that single layer networks are good enough from a learning perspective, from a mathematical perspective, one can ask the question if the class of functions represented by single layer networks is a strict subset of the class of functions represented by networks with two or more hidden layers. On the question of size, one can ask for precise bounds on the required width of a network with given depth to represent a certain class of functions. A better understanding of the function classes exactly represented by different architectures has implications not just for mathematical foundations, but also algorithmic and statistical learning aspects of neural networks, as recent advances on the training complexity show [6, 11, 28, 23, 42]. The task of searching for the “best” function in a class can only benefit from a better understanding of the nature of functions in that class. A motivating question behind the results in this paper is to understand the hierarchy of function classes exactly represented by neural networks of increasing depth.

We now introduce more precise notation and terminology to set the stage for our investigations.

1.1 Notation and Definitions

We write [n]{1,2,,n}delimited-[]𝑛12𝑛[n]\coloneqq\{1,2,\dots,n\}[ italic_n ] ≔ { 1 , 2 , … , italic_n } for the set of natural numbers up to n𝑛nitalic_n (without zero) and [n]0[n]{0}subscriptdelimited-[]𝑛0delimited-[]𝑛0[n]_{0}\coloneqq[n]\cup\{0\}[ italic_n ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≔ [ italic_n ] ∪ { 0 } for the same set including zero. For any n𝑛n\in{\mathbb{N}}italic_n ∈ blackboard_N, let σ:nn:𝜎superscript𝑛superscript𝑛\sigma\colon\mathbb{R}^{n}\to\mathbb{R}^{n}italic_σ : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be the component-wise rectifier function

σ(x)=(max{0,x1},max{0,x2},,max{0,xn}).𝜎𝑥0subscript𝑥10subscript𝑥20subscript𝑥𝑛\sigma(x)=(\max\{0,x_{1}\},\max\{0,x_{2}\},\dots,\max\{0,x_{n}\}).italic_σ ( italic_x ) = ( roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } , roman_max { 0 , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } , … , roman_max { 0 , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) .

For any number of hidden layers k𝑘k\in{\mathbb{N}}italic_k ∈ blackboard_N, a (k+1)𝑘1(k+1)( italic_k + 1 )-layer feedforward neural network with rectified linear units (ReLU NN or simply NN) is given by k𝑘kitalic_k affine transformations T():n1n:superscript𝑇superscriptsubscript𝑛1superscriptsubscript𝑛T^{(\ell)}\colon\mathbb{R}^{n_{\ell-1}}\to\mathbb{R}^{n_{\ell}}italic_T start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, xA()x+b()maps-to𝑥superscript𝐴𝑥superscript𝑏x\mapsto A^{(\ell)}x+b^{(\ell)}italic_x ↦ italic_A start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT, for [k]delimited-[]𝑘\ell\in[k]roman_ℓ ∈ [ italic_k ], and a linear transformation T(k+1):nknk+1:superscript𝑇𝑘1superscriptsubscript𝑛𝑘superscriptsubscript𝑛𝑘1T^{(k+1)}\colon\mathbb{R}^{n_{k}}\to\mathbb{R}^{n_{k+1}}italic_T start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, xA(k+1)xmaps-to𝑥superscript𝐴𝑘1𝑥x\mapsto A^{(k+1)}xitalic_x ↦ italic_A start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT italic_x. It is said to compute oder represent the function f:n0nk+1:𝑓superscriptsubscript𝑛0superscriptsubscript𝑛𝑘1f\colon\mathbb{R}^{n_{0}}\to\mathbb{R}^{n_{k+1}}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT given by

f=T(k+1)σT(k)σT(2)σT(1).𝑓superscript𝑇𝑘1𝜎superscript𝑇𝑘𝜎superscript𝑇2𝜎superscript𝑇1f=T^{(k+1)}\circ\sigma\circ T^{(k)}\circ\sigma\circ\dots\circ T^{(2)}\circ% \sigma\circ T^{(1)}.italic_f = italic_T start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ∘ italic_σ ∘ italic_T start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∘ italic_σ ∘ ⋯ ∘ italic_T start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∘ italic_σ ∘ italic_T start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT .

The matrices A()n×n1superscript𝐴superscriptsubscript𝑛subscript𝑛1A^{(\ell)}\in\mathbb{R}^{n_{\ell}\times n_{\ell-1}}italic_A start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are called the weights and the vectors b()nsuperscript𝑏superscriptsubscript𝑛b^{(\ell)}\in\mathbb{R}^{n_{\ell}}italic_b start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the biases of the \ellroman_ℓ-th layer. The number nsubscript𝑛n_{\ell}\in{\mathbb{N}}italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ blackboard_N is called the width of the \ellroman_ℓ-th layer. The maximum width of all hidden layers max[k]nsubscriptdelimited-[]𝑘subscript𝑛\max_{\ell\in[k]}n_{\ell}roman_max start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT is called the width of the NN. Further, we say that the NN has depth k+1𝑘1k+1italic_k + 1 and size =1knsuperscriptsubscript1𝑘subscript𝑛\sum_{\ell=1}^{k}n_{\ell}∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT.

Often, NNs are represented as layered, directed, acyclic graphs where each dimension of each layer (including input layer =00\ell=0roman_ℓ = 0 and output layer =k+1𝑘1\ell=k+1roman_ℓ = italic_k + 1) is one vertex, weights are arc labels, and biases are node labels. Then, the vertices are called neurons.

For a given input x=x(0)n0𝑥superscript𝑥0superscriptsubscript𝑛0x=x^{(0)}\in\mathbb{R}^{n_{0}}italic_x = italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, let y()T()(x(1))nsuperscript𝑦superscript𝑇superscript𝑥1superscriptsubscript𝑛y^{(\ell)}\coloneqq T^{(\ell)}(x^{(\ell-1)})\in\mathbb{R}^{n_{\ell}}italic_y start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ≔ italic_T start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the activation vector and x()σ(y)nsuperscript𝑥𝜎superscript𝑦superscriptsubscript𝑛x^{(\ell)}\coloneqq\sigma(y^{\ell})\in\mathbb{R}^{n_{\ell}}italic_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ≔ italic_σ ( italic_y start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT the output vector of the \ellroman_ℓ-th layer. Further, let yy(k+1)=f(x)𝑦superscript𝑦𝑘1𝑓𝑥y\coloneqq y^{(k+1)}=f(x)italic_y ≔ italic_y start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = italic_f ( italic_x ) be the output of the NN. We also say that the i𝑖iitalic_i-th component of each of these vectors is the activation or the output of the i𝑖iitalic_i-th neuron in the \ellroman_ℓ-th layer.

To illustrate the definition of NNs and how they compute functions, Figure 1 shows an NN with one hidden layer computing the maximum of two numbers.

x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTx2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTy𝑦yitalic_y11-11-11-1
Figure 1: An NN with two input neurons, labeled x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, three hidden neurons, labeled with the shape of the rectifier function, and one output neuron, labeled y𝑦yitalic_y. The arcs are labeled with their weights and all biases are zero. The NN has depth 2, width 3, and size 3. It computes the function xy=max{0,x1x2}+max{0,x2}max{0,x2}=max{0,x1x2}+x2=max{x1,x2}maps-to𝑥𝑦0subscript𝑥1subscript𝑥20subscript𝑥20subscript𝑥20subscript𝑥1subscript𝑥2subscript𝑥2subscript𝑥1subscript𝑥2x\mapsto y=\max\{0,x_{1}-x_{2}\}+\max\{0,x_{2}\}-\max\{0,-x_{2}\}=\max\{0,x_{1% }-x_{2}\}+x_{2}=\max\{x_{1},x_{2}\}italic_x ↦ italic_y = roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } + roman_max { 0 , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } - roman_max { 0 , - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } = roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } + italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_max { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }.

For k𝑘k\in{\mathbb{N}}italic_k ∈ blackboard_N, we define

ReLUn(k)subscriptReLU𝑛𝑘\displaystyle\operatorname{ReLU}_{n}(k)roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) {f:nf can be represented by a (k+1)-layer NN},absentconditional-set𝑓superscript𝑛conditionalf can be represented by a (k+1)-layer NN\displaystyle\coloneqq\{f\colon\mathbb{R}^{n}\to\mathbb{R}\mid\text{$f$ can be% represented by a $(k+1)$-layer NN}\},≔ { italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R ∣ italic_f can be represented by a ( italic_k + 1 ) -layer NN } ,
CPWLnsubscriptCPWL𝑛\displaystyle\operatorname{CPWL}_{n}roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT {f:nf is continuous and piecewise linear}.absentconditional-set𝑓superscript𝑛conditionalf is continuous and piecewise linear\displaystyle\coloneqq\{f\colon\mathbb{R}^{n}\to\mathbb{R}\mid\text{$f$ is % continuous and piecewise linear}\}.≔ { italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R ∣ italic_f is continuous and piecewise linear } .

By definition, a continuous function f:n:𝑓superscript𝑛f\colon\mathbb{R}^{n}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R is piecewise linear in case there is a finite set of polyhedra whose union is nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and f𝑓fitalic_f is affine linear over each such polyhedron.

In order to analyze ReLUn(k)subscriptReLU𝑛𝑘\operatorname{ReLU}_{n}(k)roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ), we use another function class defined as follows. We call a function g𝑔gitalic_g a p𝑝pitalic_p-term max function if it can be expressed as maximum of p𝑝pitalic_p affine terms, that is, g(x)=max{1(x),,p(x)}𝑔𝑥subscript1𝑥subscript𝑝𝑥g(x)=\max\{\ell_{1}(x),\ldots,\ell_{p}(x)\}italic_g ( italic_x ) = roman_max { roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , … , roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_x ) } where i:n:subscript𝑖superscript𝑛\ell_{i}\colon\mathbb{R}^{n}\to\mathbb{R}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R is affine linear for i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ]. Note that this also includes max functions with less than p𝑝pitalic_p terms, as some functions isubscript𝑖\ell_{i}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT may coincide. Based on that, we define

MAXn(p)subscriptMAX𝑛𝑝\displaystyle\operatorname{MAX}_{n}(p)roman_MAX start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p ) {f:nf is a linear combination of p-term max functions}.absentconditional-set𝑓superscript𝑛conditionalf is a linear combination of p-term max functions\displaystyle\coloneqq\{f\colon\mathbb{R}^{n}\to\mathbb{R}\mid\text{$f$ is a % linear combination of $p$-term max functions}\}.≔ { italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R ∣ italic_f is a linear combination of italic_p -term max functions } .

Note that Wang and Sun [74] call p𝑝pitalic_p-term max functions (p1)𝑝1(p-1)( italic_p - 1 )-order hinges and linear combinations of those (p1)𝑝1(p-1)( italic_p - 1 )-order hinging hyperplanes.

If the input dimension n𝑛nitalic_n is not important for the context, we sometimes drop the index and use ReLU(k)nReLUn(k)ReLU𝑘subscript𝑛subscriptReLU𝑛𝑘\operatorname{ReLU}(k)\coloneqq\bigcup_{n\in{\mathbb{N}}}\operatorname{ReLU}_{% n}(k)roman_ReLU ( italic_k ) ≔ ⋃ start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) and MAX(p)nMAXn(p)MAX𝑝subscript𝑛subscriptMAX𝑛𝑝\operatorname{MAX}(p)\coloneqq\bigcup_{n\in{\mathbb{N}}}\operatorname{MAX}_{n}% (p)roman_MAX ( italic_p ) ≔ ⋃ start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT roman_MAX start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p ) instead.

We will use the standard notations convAconv𝐴\operatorname{conv}Aroman_conv italic_A and coneAcone𝐴\operatorname{cone}Aroman_cone italic_A for the convex and conic hulls of a set An𝐴superscript𝑛A\subseteq\mathbb{R}^{n}italic_A ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. For an in-depth treatment of polyhedra and (mixed-integer) optimization, we refer to the book by Schrijver [64].

1.2 Representing Piecewise Linear Functions with ReLU Networks

It is not hard to see that every function expressed by a ReLU network is continuous and piecewise linear (CPWL) because it is composed of affine transformations and ReLU functions, which are both CPWL. Based on a result by Wang and Sun [74], Arora et al. [6] prove that the converse is true as well by showing that any CPWL function can be represented with logarithmic depth.

Theorem 1.1 (Arora et al. [6]).

If n𝑛n\in{\mathbb{N}}italic_n ∈ blackboard_N and klog2(n+1)superscript𝑘subscript2𝑛1k^{*}\coloneqq\lceil\log_{2}(n+1)\rceilitalic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≔ ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n + 1 ) ⌉, then CPWLn=ReLUn(k)subscriptCPWL𝑛subscriptReLU𝑛superscript𝑘\operatorname{CPWL}_{n}=\operatorname{ReLU}_{n}(k^{*})roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

Since this result is the starting point for our paper, let us briefly sketch its proof. For this purpose, we start with a simple special case of a CPWL function: the maximum of n𝑛nitalic_n numbers. Recall that one hidden layer suffices to compute the maximum of two numbers, see Figure 1. Now one can easily stack this operation: in order to compute the maximum of four numbers, we divide them into two pairs with two numbers each, compute the maximum of each pair and then the maximum of the two results. This idea results in the NN depicted in Figure 2, which has two hidden layers.

x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTx2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTx3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTx4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPTy𝑦yitalic_y11-111-111-11-11-11-11-11-11-1
Figure 2: An NN to compute the maximum of four numbers that consists of three copies of the NN in Figure 1. Note that no activiation function is applied at the two unlabeled middle vertices (representing max{x1,x2}subscript𝑥1subscript𝑥2\max\{x_{1},x_{2}\}roman_max { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } and max{x3,x4}subscript𝑥3subscript𝑥4\max\{x_{3},x_{4}\}roman_max { italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }). Therefore, the linear transformations directly before and after these vertices can be combined into a single one. Thus, the network has total depth three (two hidden layers).

Repeating this procedure, one can compute the maximum of eight numbers with three hidden layers, and, in general, the maximum of 2ksuperscript2𝑘2^{k}2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT numbers with k𝑘kitalic_k hidden layers. Phrasing this the other way around, we obtain that the maximum of n𝑛nitalic_n numbers can be computed with log2(n)subscript2𝑛\lceil\log_{2}(n)\rceil⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n ) ⌉ hidden layers. Since NNs can easily form affine combinations, this implies the following lemma.

Lemma 1.2 (Arora et al. [6]).

If n,k𝑛𝑘n,k\in{\mathbb{N}}italic_n , italic_k ∈ blackboard_N, then MAXn(2k)ReLUn(k)subscriptMAX𝑛superscript2𝑘subscriptReLU𝑛𝑘\operatorname{MAX}_{n}(2^{k})\subseteq\operatorname{ReLU}_{n}(k)roman_MAX start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ⊆ roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ).

The question whether the depth of this construction is best possible is one of the central open questions we attack in this paper.

In fact, the maximum function is not just a nice toy example, it is, in some sense, the most difficult one of all CPWL function to represent for a ReLU NN. This is due to a result by Wang and Sun [74] stating that every CPWL function defined on nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT can be written as linear combination of (n+1)𝑛1(n+1)( italic_n + 1 )-term max functions.

Theorem 1.3 (Wang and Sun [74]).

If n𝑛n\in{\mathbb{N}}italic_n ∈ blackboard_N, then CPWLn=MAXn(n+1)subscriptCPWL𝑛subscriptMAX𝑛𝑛1\operatorname{CPWL}_{n}=\operatorname{MAX}_{n}(n+1)roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_MAX start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_n + 1 ).

The proof given by Wang and Sun [74] is technically involved and we do not go into details here. However, in Section 4 we provide an alternative proof yielding a slightly stronger result. This will be useful to bound the width of NNs representing arbitrary CPWL functions.

Theorem 1.1 by Arora et al. [6] can now be deduced from combining Lemma 1.2 and Theorem 1.3: In fact, for k=log2(n+1)superscript𝑘subscript2𝑛1k^{*}=\lceil\log_{2}(n+1)\rceilitalic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n + 1 ) ⌉, one obtains

CPWLn=MAXn(n+1)ReLUn(k)CPWLnsubscriptCPWL𝑛subscriptMAX𝑛𝑛1subscriptReLU𝑛superscript𝑘subscriptCPWL𝑛\operatorname{CPWL}_{n}=\operatorname{MAX}_{n}(n+1)\subseteq\operatorname{ReLU% }_{n}(k^{*})\subseteq\operatorname{CPWL}_{n}roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_MAX start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_n + 1 ) ⊆ roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⊆ roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

and thus equality in the whole chain of subset relations.

1.3 Our Main Conjecture

We wish to understand whether the logarithmic depth bound in Theorem 1.1 by Arora et al. [6] is best possible or whether one can do better. We believe it is indeed best possible and pose the following conjecture to better understand the importance of depth in neural networks.

Conjecture 1.4.

For every n𝑛n\in{\mathbb{N}}italic_n ∈ blackboard_N, let klog2(n+1)superscript𝑘subscript2𝑛1k^{*}\coloneqq\lceil\log_{2}(n+1)\rceilitalic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≔ ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n + 1 ) ⌉. Then it holds that

ReLUn(0)ReLUn(1)ReLUn(k1)ReLUn(k)=CPWLn.subscriptReLU𝑛0subscriptReLU𝑛1subscriptReLU𝑛superscript𝑘1subscriptReLU𝑛superscript𝑘subscriptCPWL𝑛\operatorname{ReLU}_{n}(0)\subsetneq\operatorname{ReLU}_{n}(1)\subsetneq\dots% \subsetneq\operatorname{ReLU}_{n}(k^{*}-1)\subsetneq\operatorname{ReLU}_{n}(k^% {*})=\operatorname{CPWL}_{n}.roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 0 ) ⊊ roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 ) ⊊ ⋯ ⊊ roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - 1 ) ⊊ roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . (1)

1.4 claims that any additional layer up to ksuperscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT hidden layers strictly increases the set of representable functions. This would imply that the construction by Arora et al. [6] is actually depth-minimal.

Observe that, in order to prove 1.4, it is sufficient to find, for every ksuperscript𝑘k^{*}\in{\mathbb{N}}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_N, one function fReLUn(k)ReLUn(k1)𝑓subscriptReLU𝑛superscript𝑘subscriptReLU𝑛superscript𝑘1f\in\operatorname{ReLU}_{n}(k^{*})\setminus\operatorname{ReLU}_{n}(k^{*}-1)italic_f ∈ roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∖ roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - 1 ) with n=2k1𝑛superscript2superscript𝑘1n=2^{k^{*}-1}italic_n = 2 start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. This also implies all other strict inclusions ReLUn(i1)ReLUn(i)subscriptReLU𝑛𝑖1subscriptReLU𝑛𝑖\operatorname{ReLU}_{n}(i-1)\subsetneq\operatorname{ReLU}_{n}(i)roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i - 1 ) ⊊ roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) for i<k𝑖superscript𝑘i<k^{*}italic_i < italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT because ReLUn(i1)=ReLUn(i)subscriptReLU𝑛𝑖1subscriptReLU𝑛𝑖\operatorname{ReLU}_{n}(i-1)=\operatorname{ReLU}_{n}(i)roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i - 1 ) = roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) immediately implies that ReLUn(i1)=ReLUn(i)subscriptReLU𝑛𝑖1subscriptReLU𝑛superscript𝑖\operatorname{ReLU}_{n}(i-1)=\operatorname{ReLU}_{n}(i^{\prime})roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i - 1 ) = roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for all ii1superscript𝑖𝑖1i^{\prime}\geq i-1italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ italic_i - 1.

In fact, thanks to Theorem 1.3 by Wang and Sun [74], there is a canonical candidate for such a function, allowing us to reformulate the conjecture as follows.

Conjecture 1.5.

For k𝑘k\in{\mathbb{N}}italic_k ∈ blackboard_N, n=2k𝑛superscript2𝑘n=2^{k}italic_n = 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, the function fn(x)=max{0,x1,,xn}subscript𝑓𝑛𝑥0subscript𝑥1subscript𝑥𝑛f_{n}(x)=\max\{0,x_{1},\dots,x_{n}\}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) = roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } cannot be represented with k𝑘kitalic_k hidden layers, that is, fnReLUn(k)subscript𝑓𝑛subscriptReLU𝑛𝑘f_{n}\notin\operatorname{ReLU}_{n}(k)italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ).

Proposition 1.6.

1.4 and 1.5 are equivalent.

Proof.

We argued above that 1.5 implies 1.4. For the other direction, we prove the contraposition, that is, assuming that 1.5 is violated, we show that 1.4 is violated as well. To this end, suppose there is a k𝑘k\in{\mathbb{N}}italic_k ∈ blackboard_N, n=2k𝑛superscript2𝑘n=2^{k}italic_n = 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, such that fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is representable with k𝑘kitalic_k hidden layers. We argue that under this hypothesis, any (n+1)𝑛1(n+1)( italic_n + 1 )-term max function can be represented with k𝑘kitalic_k hidden layers. To see this, observe that

max{1(x),,n+1(x)}=max{0,1(x)n+1(x),,n(x)n+1(x)}+n+1(x).subscript1𝑥subscript𝑛1𝑥0subscript1𝑥subscript𝑛1𝑥subscript𝑛𝑥subscript𝑛1𝑥subscript𝑛1𝑥\max\{\ell_{1}(x),\ldots,\ell_{n+1}(x)\}=\max\{0,\ell_{1}(x)-\ell_{n+1}(x),% \ldots,\ell_{n}(x)-\ell_{n+1}(x)\}+\ell_{n+1}(x).roman_max { roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , … , roman_ℓ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) } = roman_max { 0 , roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) - roman_ℓ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) , … , roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) - roman_ℓ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) } + roman_ℓ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) .

Modifying the first-layer weights of the NN computing fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT such that input xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is replaced by the affine expression i(x)n+1(x)subscript𝑖𝑥subscript𝑛1𝑥\ell_{i}(x)-\ell_{n+1}(x)roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - roman_ℓ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ), one obtains a k𝑘kitalic_k-hidden-layer NN computing the function max{0,1(x)n+1(x),,n(x)n+1(x)}0subscript1𝑥subscript𝑛1𝑥subscript𝑛𝑥subscript𝑛1𝑥\max\{0,\ell_{1}(x)-\ell_{n+1}(x),\ldots,\ell_{n}(x)-\ell_{n+1}(x)\}roman_max { 0 , roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) - roman_ℓ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) , … , roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) - roman_ℓ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) }. Moreover, since affine functions, in particular also n+1(x)subscript𝑛1𝑥\ell_{n+1}(x)roman_ℓ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ), can easily be represented by k𝑘kitalic_k-hidden-layer NNs, we obtain that any (n+1)𝑛1(n+1)( italic_n + 1 )-term maximum is in ReLUn(k)subscriptReLU𝑛𝑘\operatorname{ReLU}_{n}(k)roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ). Using Theorem 1.3 by Wang and Sun [74], it follows that ReLUn(k)=CPWLnsubscriptReLU𝑛𝑘subscriptCPWL𝑛\operatorname{ReLU}_{n}(k)=\operatorname{CPWL}_{n}roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) = roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. In particular, since klog2(n+1)=k+1superscript𝑘subscript2𝑛1𝑘1k^{*}\coloneqq\lceil\log_{2}(n+1)\rceil=k+1italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≔ ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n + 1 ) ⌉ = italic_k + 1, we obtain that 1.4 must be violated as well. ∎

It is known that 1.5 holds for k=1𝑘1k=1italic_k = 1 [56], that is, the CPWL function max{0,x1,x2}0subscript𝑥1subscript𝑥2\max\{0,x_{1},x_{2}\}roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } cannot be computed by a 2-layer NN. The reason for this is that the set of breakpoints of a CPWL function computed by a 2-layer NN is always a union of lines, while the set of breakpoints of max{0,x1,x2}0subscript𝑥1subscript𝑥2\max\{0,x_{1},x_{2}\}roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } is a union of three half-lines; compare Figure 3 and the detailed proof by Mukherjee and Basu [56]. Moreover, in subsequent work to the first version of this article, it was shown that the conjecture is true for all k𝑘k\in{\mathbb{N}}italic_k ∈ blackboard_N if one only allows integer weights in the neural network [31]. However, this proof does not easily generalize to arbitrary, real-valued weights. Thus, the conjecture remains open for all k2𝑘2k\geq 2italic_k ≥ 2.

x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT00x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTx2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT\cdotsy𝑦yitalic_y
Figure 3: Set of breakpoints of the function max{0,x1,x2}0subscript𝑥1subscript𝑥2\max\{0,x_{1},x_{2}\}roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } (left). This function cannot be computed by a 2-layer NN (middle), since the set of breakpoints of any function computed by such an NN is always a union of lines (right).

1.4 Contribution and Outline

In this paper, we present the following results as partial progress towards resolving this conjecture.

In Section 2, we resolve 1.5 for k=2𝑘2k=2italic_k = 2, under a natural assumption on the breakpoints of the function represented by any intermediate neuron. Intuitively, the assumption states that no neuron introduces unexpected breakpoints compared to the final function we want to represent. We call such neural networks H𝐻Hitalic_H-conforming, see Section 2 for a formal definition. We then provide a computer-based proof leveraging techniques from mixed-integer programming for the following theorem.

Theorem 1.7.

There does not exist an H𝐻Hitalic_H-conforming 3-layer ReLU NN computing the function max{0,x1,x2,x3,x4}0subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4\max\{0,x_{1},x_{2},x_{3},x_{4}\}roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }.

In the light of Lemma 1.2, stating that MAX(2k)ReLU(k)MAXsuperscript2𝑘ReLU𝑘\operatorname{MAX}(2^{k})\subseteq\operatorname{ReLU}(k)roman_MAX ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ⊆ roman_ReLU ( italic_k ) for all k𝑘k\in{\mathbb{N}}italic_k ∈ blackboard_N, one might ask whether the converse is true as well, that is, whether the classes MAX(2k)MAXsuperscript2𝑘\operatorname{MAX}(2^{k})roman_MAX ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) and ReLU(k)ReLU𝑘\operatorname{ReLU}(k)roman_ReLU ( italic_k ) are actually equal. This would not only provide a neat characterization of ReLU(k)ReLU𝑘\operatorname{ReLU}(k)roman_ReLU ( italic_k ), but also prove 1.5 without any additional assumption since one can show that max{0,x1,,x2k}0subscript𝑥1subscript𝑥superscript2𝑘\max\{0,x_{1},\dots,x_{2^{k}}\}roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } is not contained in MAX(2k)MAXsuperscript2𝑘\operatorname{MAX}(2^{k})roman_MAX ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).

In fact, for k=1𝑘1k=1italic_k = 1, it is true that ReLU(1)=MAX(2)ReLU1MAX2\operatorname{ReLU}(1)=\operatorname{MAX}(2)roman_ReLU ( 1 ) = roman_MAX ( 2 ), that is, a function is computable with one hidden layer if and only if it is a linear combination of 2-term max functions. However, in Section 3, we show the following theorem.

Theorem 1.8.

For every k2𝑘2k\geq 2italic_k ≥ 2, the set ReLU(k)ReLU𝑘\operatorname{ReLU}(k)roman_ReLU ( italic_k ) is a strict superset of MAX(2k)MAXsuperscript2𝑘\operatorname{MAX}(2^{k})roman_MAX ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).

To achieve this result, the key technical ingredient is the theory of polyhedral complexes associated with CPWL functions. This way, we provide important insights concerning the richness of the class ReLU(k)ReLU𝑘\operatorname{ReLU}(k)roman_ReLU ( italic_k ). As a by-product, the results in Section 3 imply that MAXn(n)subscriptMAX𝑛𝑛\operatorname{MAX}_{n}(n)roman_MAX start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_n ) is a strict subset of CPWLn=MAXn(n+1)subscriptCPWL𝑛subscriptMAX𝑛𝑛1\operatorname{CPWL}_{n}=\operatorname{MAX}_{n}(n+1)roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_MAX start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_n + 1 ), which was conjectured by Wang and Sun [74] in 2005, but has been open since then.

So far, we have focused on understanding the smallest depth needed to express CPWL functions using neural networks with ReLU activations. In Section 4, we complement these results by upper bounds on the sizes of the networks needed for expressing arbitrary CPWL functions. In particular, we show the following theorem.

Theorem 1.9.

Let f:n:𝑓superscript𝑛f\colon\mathbb{R}^{n}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R be a CPWL function with p𝑝pitalic_p affine pieces. Then f𝑓fitalic_f can be represented by a ReLU NN with depth log2(n+1)+1subscript2𝑛11\lceil\log_{2}(n+1)\rceil+1⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n + 1 ) ⌉ + 1 and width 𝒪(p2n2+3n+1)𝒪superscript𝑝2superscript𝑛23𝑛1\mathcal{O}(p^{2n^{2}+3n+1})caligraphic_O ( italic_p start_POSTSUPERSCRIPT 2 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 italic_n + 1 end_POSTSUPERSCRIPT ).

We arrive at this result by introducing a novel application of recently established interrelations between neural networks and tropical geometry.

Theorem 1.9 improves upon a previous bound by He et al. [35] because it is polynomial in p𝑝pitalic_p if n𝑛nitalic_n is regarded as fixed constant, while the bounds in [35] are exponential in p𝑝pitalic_p. In subsequent work to the first version of our article, it was shown that the width of the network can be drastically decreased if one allows more depth (in the order of log(p)𝑝\log(p)roman_log ( italic_p ) instead of log(n)𝑛\log(n)roman_log ( italic_n )[16].

Let us remark that there are different definitions of the number of pieces p𝑝pitalic_p of a CPWL function f𝑓fitalic_f in the literature, compare the discussions in [16, 35] about pieces versus linear components. Our bounds work with any of these definitions since they apply to the smallest possible way to define p𝑝pitalic_p, called linear components in [16]: for our purposes, p𝑝pitalic_p can be defined as the smallest number of affine functions such that, at each point, f𝑓fitalic_f is equal to one of these affine functions. Since all other definitions of the number of pieces are at least that large, our bounds are valid for these definitions as well.

Finally, in Section 5, we provide an outlook how these interactions between tropical geometry and NNs could possibly also be useful to provide a full, unconditional proof of 1.4 by means of polytope theory. This yields another equivalent rephrasing of 1.4 which is stated purely in the language of basic operations on polytopes and does not involve neural networks any more.

We conclude in Section 6 with a discussion of further open research questions.

1.5 Further Related Work

Depth versus size

Soon after the original universal approximation theorems [18, 40], concrete bounds were obtained on the number of neurons needed in the hidden layer to achieve a certain level of accuracy. The literature on this is vast and we refer to a small representative sample here [8, 9, 51, 60, 52, 53]. More recent research has focused on how deeper networks can have exponentially or super exponentially smaller size compared to shallower networks [72, 6, 21, 32, 33, 46, 57, 61, 63, 69, 70, 75]. See also [29] for another perspective on the relationship between expressivity and architecture, and the references therein.

Mixed-integer optimization and machine learning

Over the past decade, a growing body of work has emerged that explores the interplay between mixed-integer optimization and machine learning. On the one hand, researchers have attempted to improve mixed-integer optimization algorithms by exploiting novel techniques from machine learning [13, 24, 34, 43, 44, 45, 47, 3]; see also [10] for a recent survey. On the flip side, mixed-integer optimization techniques have been used to analyze function classes represented by neural networks [67, 4, 22, 66, 65]. In Section 2 below, we show another new use of mixed-integer optimization tools for understanding function classes represented by neural networks.

Design of training algorithms

We believe that a better understanding of the function classes represented exactly by a neural architecture also has benefits in terms of understanding the complexity of the training problem. For instance, in work by Arora et al. [6], an understanding of single layer ReLU networks enables the design of a globally optimal algorithm for solving the empirical risk minimization (ERM) problem, that runs in polynomial time in the number of data points in fixed dimension. See also [25, 26, 27, 19, 14, 28, 23, 1, 12, 11, 17, 42] for similar lines of work.

Neural Networks and Tropical Geometry

A recent stream of research involves the interplay between neural networks and tropical geometry. The piecewise linear functions computed by neural networks can be seen as (tropical quotients of) tropical polynomials. Linear regions of these functions correspond to vertices of so-called Newton polytopes associated with these tropical polynomials. Applications of this correspondence include bounding the number of linear regions of a neural network [76, 15, 54] and understanding decision boundaries [2]. In Section 4 we present a novel application of tropical concepts to understand neural networks. We refer to [50] for a recent survey of connections between machine learning and tropical geometry, as well as to the textbooks by Maclagan and Sturmfels [49] and Joswig [41] for in-depth introductions to tropical geometry and tropical combinatorics.

2 Conditional Lower Depth Bounds via Mixed-Integer Programming

In this section, we provide a computer-aided proof that, under a natural, yet unproven assumption, the function f(x)max{0,x1,x2,x3,x4}𝑓𝑥0subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4f(x)\coloneqq\max\{0,x_{1},x_{2},x_{3},x_{4}\}italic_f ( italic_x ) ≔ roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT } cannot be represented by a 3-layer NN. It is worth to note that, to the best of our knowledge, no CPWL function is known for which the non-existence of a 3-layer NN can be proven without additional assumptions. For ease of notation, we write x00subscript𝑥00x_{0}\coloneqq 0italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≔ 0.

We first prove that we may restrict ourselves to NNs without biases. This holds true independent of our assumption, which we introduce afterwards.

Definition 2.1.

A function g:nm:𝑔superscript𝑛superscript𝑚g\colon\mathbb{R}^{n}\to\mathbb{R}^{m}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is called positively homogeneous if it satisfies g(λx)=λg(x)𝑔𝜆𝑥𝜆𝑔𝑥g(\lambda x)=\lambda g(x)italic_g ( italic_λ italic_x ) = italic_λ italic_g ( italic_x ) for all λ0𝜆0\lambda\geq 0italic_λ ≥ 0.

Definition 2.2.

For an NN given by transformations T()(x)=A()x+b()superscript𝑇𝑥superscript𝐴𝑥superscript𝑏T^{(\ell)}(x)=A^{(\ell)}x+b^{(\ell)}italic_T start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ( italic_x ) = italic_A start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT, we define the corresponding homogenized NN to be the NN given by T~()(x)=A()xsuperscript~𝑇𝑥superscript𝐴𝑥\tilde{T}^{(\ell)}(x)=A^{(\ell)}xover~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ( italic_x ) = italic_A start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT italic_x with all biases set to zero.

Proposition 2.3.

If an NN computes a positively homogeneous function, then the corresponding homogenized NN computes the same function.

Proof.

Let g:n0nk+1:𝑔superscriptsubscript𝑛0superscriptsubscript𝑛𝑘1g\colon\mathbb{R}^{n_{0}}\to\mathbb{R}^{n_{k+1}}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the function computed by the original NN and g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG the one computed by the homogenized NN. Further, for any 0k0𝑘0\leq\ell\leq k0 ≤ roman_ℓ ≤ italic_k, let

g()=T(+1)σT()T(2)σT(1)superscript𝑔superscript𝑇1𝜎superscript𝑇superscript𝑇2𝜎superscript𝑇1g^{(\ell)}=T^{(\ell+1)}\circ\sigma\circ T^{(\ell)}\circ\dots\circ T^{(2)}\circ% \sigma\circ T^{(1)}italic_g start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT = italic_T start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT ∘ italic_σ ∘ italic_T start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ∘ ⋯ ∘ italic_T start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∘ italic_σ ∘ italic_T start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT

be the function computed by the sub-NN consisting of the first (+1)1(\ell+1)( roman_ℓ + 1 )-layers and let g~()superscript~𝑔\tilde{g}^{(\ell)}over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT be the function computed by the corresponding homogenized sub-NN. We first show by induction on \ellroman_ℓ that the norm of g()(x)g~()(x)delimited-∥∥superscript𝑔𝑥superscript~𝑔𝑥\lVert g^{(\ell)}(x)-\tilde{g}^{(\ell)}(x)\rVert∥ italic_g start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ( italic_x ) - over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ( italic_x ) ∥ is bounded by a global constant that only depends on the parameters of the NN but not on x𝑥xitalic_x.

For =00\ell=0roman_ℓ = 0, we have g(0)(x)g~(0)(x)=b(1)C0delimited-∥∥superscript𝑔0𝑥superscript~𝑔0𝑥delimited-∥∥superscript𝑏1subscript𝐶0\lVert g^{(0)}(x)-\tilde{g}^{(0)}(x)\rVert=\lVert b^{(1)}\rVert\eqqcolon C_{0}∥ italic_g start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_x ) - over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_x ) ∥ = ∥ italic_b start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∥ ≕ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, settling the induction base. For the induction step, let 11\ell\geq 1roman_ℓ ≥ 1 and assume that g(1)(x)g~(1)(x)C1delimited-∥∥superscript𝑔1𝑥superscript~𝑔1𝑥subscript𝐶1\lVert g^{(\ell-1)}(x)-\tilde{g}^{(\ell-1)}(x)\rVert\leq C_{\ell-1}∥ italic_g start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT ( italic_x ) - over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT ( italic_x ) ∥ ≤ italic_C start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT, where C1subscript𝐶1C_{\ell-1}italic_C start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT only depends on the parameters of the NN. Since a component-wise application of the ReLU activation function has Lipschitz constant 1, this implies (σg(1))(x)(σg~(1))(x)C1delimited-∥∥𝜎superscript𝑔1𝑥𝜎superscript~𝑔1𝑥subscript𝐶1\lVert(\sigma\circ g^{(\ell-1)})(x)-(\sigma\circ\tilde{g}^{(\ell-1)})(x)\rVert% \leq C_{\ell-1}∥ ( italic_σ ∘ italic_g start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT ) ( italic_x ) - ( italic_σ ∘ over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT ) ( italic_x ) ∥ ≤ italic_C start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT. Using the spectral matrix norm Adelimited-∥∥𝐴\lVert A\rVert∥ italic_A ∥ of a matrix A𝐴Aitalic_A, we obtain:

g()(x)g~()(x)delimited-∥∥superscript𝑔𝑥superscript~𝑔𝑥\displaystyle\lVert g^{(\ell)}(x)-\tilde{g}^{(\ell)}(x)\rVert~{}∥ italic_g start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ( italic_x ) - over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ( italic_x ) ∥ =b(+1)+A(+1)((σg(1))(x)(σg~(1))(x))absentdelimited-∥∥superscript𝑏1superscript𝐴1𝜎superscript𝑔1𝑥𝜎superscript~𝑔1𝑥\displaystyle=~{}\lVert b^{(\ell+1)}+A^{(\ell+1)}((\sigma\circ g^{(\ell-1)})(x% )-(\sigma\circ\tilde{g}^{(\ell-1)})(x))\rVert= ∥ italic_b start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT + italic_A start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT ( ( italic_σ ∘ italic_g start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT ) ( italic_x ) - ( italic_σ ∘ over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT ) ( italic_x ) ) ∥
b(+1)+A(+1)C1Cabsentdelimited-∥∥superscript𝑏1delimited-∥∥superscript𝐴1subscript𝐶1subscript𝐶\displaystyle\leq~{}\lVert b^{(\ell+1)}\rVert+\lVert A^{(\ell+1)}\rVert\cdot C% _{\ell-1}~{}\eqqcolon~{}C_{\ell}≤ ∥ italic_b start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT ∥ + ∥ italic_A start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT ∥ ⋅ italic_C start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT ≕ italic_C start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT

Since the right-hand side only depends on NN parameters, the induction is completed.

Finally, we show that g=g~𝑔~𝑔g=\tilde{g}italic_g = over~ start_ARG italic_g end_ARG. For the sake of contradiction, suppose that there is an xn0𝑥superscriptsubscript𝑛0x\in\mathbb{R}^{n_{0}}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with g(x)g~(x)=δ>0delimited-∥∥𝑔𝑥~𝑔𝑥𝛿0\lVert g(x)-\tilde{g}(x)\rVert=\delta>0∥ italic_g ( italic_x ) - over~ start_ARG italic_g end_ARG ( italic_x ) ∥ = italic_δ > 0. Let xCk+1δxsuperscript𝑥subscript𝐶𝑘1𝛿𝑥x^{\prime}\coloneqq\frac{C_{k}+1}{\delta}xitalic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≔ divide start_ARG italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_ARG start_ARG italic_δ end_ARG italic_x; then, by positive homogeneity of g𝑔gitalic_g (by assumption) and g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG (by construction and because the ReLU function is positively homogeneous), it follows that g(x)g~(x)=Ck+1>Ckdelimited-∥∥𝑔superscript𝑥~𝑔superscript𝑥subscript𝐶𝑘1subscript𝐶𝑘\lVert g(x^{\prime})-\tilde{g}(x^{\prime})\rVert=C_{k}+1>C_{k}∥ italic_g ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - over~ start_ARG italic_g end_ARG ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ = italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 > italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, contradicting the property shown above. Thus, we have g=g~𝑔~𝑔g=\tilde{g}italic_g = over~ start_ARG italic_g end_ARG. ∎

Since f=max{0,x1,x2,x3,x4}𝑓0subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4f=\max\{0,x_{1},x_{2},x_{3},x_{4}\}italic_f = roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT } is positively homogeneous, Proposition 2.3 implies that, if there is a 3-layer NN computing f𝑓fitalic_f, then there also is one that has no biases. Therefore, in the remainder of this section, we only consider NNs without biases and assume implicitly that all considered CPWL functions are positively homogeneous. In particular, any piece of such a CPWL function is linear and not only affine linear.

Observe that, for the function f𝑓fitalic_f, the only points of non-differentiability (a.k.a. breakpoints) are at places where at least two of the five numbers x0=0subscript𝑥00x_{0}=0italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0, x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and x4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are equal. Hence, if some neuron of an NN computing f𝑓fitalic_f introduces breakpoints at other places, these breakpoints must be canceled out by other neurons. Therefore, we find it natural to work under the assumption that such breakpoints need not be introduced at all in the first place.

To make this assumption formal, let Hij={x4xi=xj}subscript𝐻𝑖𝑗conditional-set𝑥superscript4subscript𝑥𝑖subscript𝑥𝑗H_{ij}=\{x\in\mathbb{R}^{4}\mid x_{i}=x_{j}\}italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }, for 0i<j40𝑖𝑗40\leq i<j\leq 40 ≤ italic_i < italic_j ≤ 4, be ten hyperplanes in 4superscript4\mathbb{R}^{4}blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and H=0i<j4Hij𝐻subscript0𝑖𝑗4subscript𝐻𝑖𝑗H=\bigcup_{0\leq i<j\leq 4}H_{ij}italic_H = ⋃ start_POSTSUBSCRIPT 0 ≤ italic_i < italic_j ≤ 4 end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT be the corresponding hyperplane arrangement. This is the intersection of the so-called braid arrangement in five dimensions with the hyperplane x0=0subscript𝑥00x_{0}=0italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 [68]. The regions oder cells of H𝐻Hitalic_H are defined to be the closures of the connected components of 4Hsuperscript4𝐻\mathbb{R}^{4}\setminus Hblackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∖ italic_H. It is easy to see that these regions are in one-to-one correspondence to the 5!=12051205!=1205 ! = 120 possible orderings of the five numbers x0=0subscript𝑥00x_{0}=0italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0, x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and x4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. More precisely, for a permutation π𝜋\piitalic_π of the five indices [4]0={0,1,2,3,4}subscriptdelimited-[]4001234[4]_{0}=\{0,1,2,3,4\}[ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { 0 , 1 , 2 , 3 , 4 }, the corresponding region is the polyhedron

Cπ{x4xπ(0)xπ(1)xπ(2)xπ(3)xπ(4)}.subscript𝐶𝜋conditional-set𝑥superscript4subscript𝑥𝜋0subscript𝑥𝜋1subscript𝑥𝜋2subscript𝑥𝜋3subscript𝑥𝜋4C_{\pi}~{}\coloneqq~{}\{x\in\mathbb{R}^{4}\mid x_{\pi(0)}\leq x_{\pi(1)}\leq x% _{\pi(2)}\leq x_{\pi(3)}\leq x_{\pi(4)}\}.italic_C start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ≔ { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_π ( 0 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 1 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 2 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 3 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 4 ) end_POSTSUBSCRIPT } .
Definition 2.4.

We say that a (positively homogeneous) CPWL function g𝑔gitalic_g is H𝐻Hitalic_H-conforming, if it is linear within any of these regions of H𝐻Hitalic_H, that is, if it only has breakpoints where the relative ordering of the five values x0=0subscript𝑥00x_{0}=0italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0, x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, x4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT changes. Moreover, an NN is said to be H𝐻Hitalic_H-conforming if the output of each neuron contained in the NN is H𝐻Hitalic_H-conforming.

See Figure 4 for an illustration of the definition in the (simpler) two-dimensional case. Note that, by the definition, an NN is H𝐻Hitalic_H-conforming if and only if, for all layers [k]delimited-[]𝑘\ell\in[k]roman_ℓ ∈ [ italic_k ], the intermediate function σT()σT(1)σT(1)𝜎superscript𝑇𝜎superscript𝑇1𝜎superscript𝑇1\sigma\circ T^{(\ell)}\circ\sigma\circ T^{(\ell-1)}\circ\dots\circ\sigma\circ T% ^{(1)}italic_σ ∘ italic_T start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ∘ italic_σ ∘ italic_T start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT ∘ ⋯ ∘ italic_σ ∘ italic_T start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT is H𝐻Hitalic_H-conforming.

x1subscript𝑥1absentx_{1}\geqitalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥x20subscript𝑥20x_{2}\geq 0italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 00x20subscript𝑥20\geq x_{2}0 ≥ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTx1absentsubscript𝑥1\geq x_{1}≥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTx10subscript𝑥10x_{1}\geq 0italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ 0x2absentsubscript𝑥2\geq x_{2}≥ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTx2subscript𝑥2absentx_{2}\geqitalic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥0x10subscript𝑥10\geq x_{1}0 ≥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTx2x1subscript𝑥2subscript𝑥1x_{2}\geq x_{1}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT0absent0\geq 0≥ 000absent0\geq0 ≥x1x2subscript𝑥1subscript𝑥2x_{1}\geq x_{2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Figure 4: A function is H𝐻Hitalic_H-conforming if the set of breakpoints is a subset of the hyperplane arrangement H𝐻Hitalic_H. The arrangement H𝐻Hitalic_H consists of all hyperplanes where two of the coordinates (possibly including x0=0subscript𝑥00x_{0}=0italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0) are equal. Here, H𝐻Hitalic_H is illustrated for the (simpler) two-dimensional case, where it consists of three hyperplanes that divide the space into six cells.

As argued above, it is plausible that considering H𝐻Hitalic_H-conforming NNs is enough to prove 1.4. In other words, we conjecture that, if there exists a 3-layer NN computing the function f(x)=max{0,x1,x2,x3,x4}𝑓𝑥0subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4f(x)=\max\{0,x_{1},x_{2},x_{3},x_{4}\}italic_f ( italic_x ) = roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }, then there also exists one that is H𝐻Hitalic_H-conforming. This motivates the following theorem, which we prove computer-aided by means of mixed-integer programming.

See 1.7

The remainder of this section is devoted to proving this theorem. The rough outline of the proof is as follows. We first study some geometric properties of the hyperplane arrangement H𝐻Hitalic_H. This will show that each of the 120120120120 cells of H𝐻Hitalic_H is a simplicial polyhedral cone spanned by 4444 extreme rays. In total, there are 30303030 such rays (because rays are used multiple times to span different cones). This implies that each H𝐻Hitalic_H-conforming function is uniquely determined by its values on the 30303030 rays and, therefore, the set of H𝐻Hitalic_H-conforming functions of type 4superscript4\mathbb{R}^{4}\to\mathbb{R}blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT → blackboard_R is a 30303030-dimensional vector space. We then use linear algebra to show that the space of functions generated by H𝐻Hitalic_H/̄conforming two-layer NNs is a 14141414-dimensional subspace. Moreover, with two hidden layers, at least 29292929 of the 30303030 dimensions can be generated and f𝑓fitalic_f is not contained in this 29292929-dimensional subspace. So the remaining question is whether the 14141414 dimensions producible with the first hidden layer can be combined in such a way that after applying a ReLU activation in the second hidden layer, we do not end up within the 29292929-dimensional subspace. We model this question as a mixed-integer program (MIP). Solving the MIP yields that we always end up within the 29292929-dimensional subspace, implying that f𝑓fitalic_f cannot be represented by a 3-layer NN. This provides a computational proof of Theorem 1.7.

Let us start with investigating the structure of the hyperplane arrangement H𝐻Hitalic_H. For readers familiar with the interplay between hyperplane arrangements and polytopes, it is worth noting that H𝐻Hitalic_H is dual to a combinatorial equivalent of the 4/̄dimensional permutahedron. Hence, what we are studying in the following are some combinatorial properties of the permutahedron.

Recall that the regions of H𝐻Hitalic_H are given by the 120 polyhedra

Cπ{x4xπ(0)xπ(1)xπ(2)xπ(3)xπ(4)}4subscript𝐶𝜋conditional-set𝑥superscript4subscript𝑥𝜋0subscript𝑥𝜋1subscript𝑥𝜋2subscript𝑥𝜋3subscript𝑥𝜋4superscript4C_{\pi}~{}\coloneqq~{}\{x\in\mathbb{R}^{4}\mid x_{\pi(0)}\leq x_{\pi(1)}\leq x% _{\pi(2)}\leq x_{\pi(3)}\leq x_{\pi(4)}\}\subseteq\mathbb{R}^{4}italic_C start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ≔ { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_π ( 0 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 1 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 2 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 3 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 4 ) end_POSTSUBSCRIPT } ⊆ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT

for each permutation π𝜋\piitalic_π of [4]0subscriptdelimited-[]40[4]_{0}[ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is used as a replacement for 00. With this representation, one can see that Cπsubscript𝐶𝜋C_{\pi}italic_C start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is a pointed polyhedral cone (with the origin as its only vertex) spanned by the four half-lines (a.k.a. rays)

R{π(0)}subscript𝑅𝜋0\displaystyle R_{\{\pi(0)\}}~{}italic_R start_POSTSUBSCRIPT { italic_π ( 0 ) } end_POSTSUBSCRIPT {x4xπ(0)xπ(1)=xπ(2)=xπ(3)=xπ(4)},absentconditional-set𝑥superscript4subscript𝑥𝜋0subscript𝑥𝜋1subscript𝑥𝜋2subscript𝑥𝜋3subscript𝑥𝜋4\displaystyle\coloneqq~{}\{x\in\mathbb{R}^{4}\mid x_{\pi(0)}\leq x_{\pi(1)}=x_% {\pi(2)}=x_{\pi(3)}=x_{\pi(4)}\},≔ { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_π ( 0 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 1 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 2 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 3 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 4 ) end_POSTSUBSCRIPT } ,
R{π(0),π(1)}subscript𝑅𝜋0𝜋1\displaystyle R_{\{\pi(0),\pi(1)\}}~{}italic_R start_POSTSUBSCRIPT { italic_π ( 0 ) , italic_π ( 1 ) } end_POSTSUBSCRIPT {x4xπ(0)=xπ(1)xπ(2)=xπ(3)=xπ(4)},absentconditional-set𝑥superscript4subscript𝑥𝜋0subscript𝑥𝜋1subscript𝑥𝜋2subscript𝑥𝜋3subscript𝑥𝜋4\displaystyle\coloneqq~{}\{x\in\mathbb{R}^{4}\mid x_{\pi(0)}=x_{\pi(1)}\leq x_% {\pi(2)}=x_{\pi(3)}=x_{\pi(4)}\},≔ { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_π ( 0 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 1 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 2 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 3 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 4 ) end_POSTSUBSCRIPT } ,
R{π(0),π(1),π(2)}subscript𝑅𝜋0𝜋1𝜋2\displaystyle R_{\{\pi(0),\pi(1),\pi(2)\}}~{}italic_R start_POSTSUBSCRIPT { italic_π ( 0 ) , italic_π ( 1 ) , italic_π ( 2 ) } end_POSTSUBSCRIPT {x4xπ(0)=xπ(1)=xπ(2)xπ(3)=xπ(4)},absentconditional-set𝑥superscript4subscript𝑥𝜋0subscript𝑥𝜋1subscript𝑥𝜋2subscript𝑥𝜋3subscript𝑥𝜋4\displaystyle\coloneqq~{}\{x\in\mathbb{R}^{4}\mid x_{\pi(0)}=x_{\pi(1)}=x_{\pi% (2)}\leq x_{\pi(3)}=x_{\pi(4)}\},≔ { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_π ( 0 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 1 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 2 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 3 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 4 ) end_POSTSUBSCRIPT } ,
R{π(0),π(1),π(2),π(3)}subscript𝑅𝜋0𝜋1𝜋2𝜋3\displaystyle R_{\{\pi(0),\pi(1),\pi(2),\pi(3)\}}~{}italic_R start_POSTSUBSCRIPT { italic_π ( 0 ) , italic_π ( 1 ) , italic_π ( 2 ) , italic_π ( 3 ) } end_POSTSUBSCRIPT {x4xπ(0)=xπ(1)=xπ(2)=xπ(3)xπ(4)}.absentconditional-set𝑥superscript4subscript𝑥𝜋0subscript𝑥𝜋1subscript𝑥𝜋2subscript𝑥𝜋3subscript𝑥𝜋4\displaystyle\coloneqq~{}\{x\in\mathbb{R}^{4}\mid x_{\pi(0)}=x_{\pi(1)}=x_{\pi% (2)}=x_{\pi(3)}\leq x_{\pi(4)}\}.≔ { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_π ( 0 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 1 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 2 ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_π ( 3 ) end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_π ( 4 ) end_POSTSUBSCRIPT } .

Observe that these objects are indeed rays anchored at the origin because the three equalities define a one-dimensional subspace of 4superscript4\mathbb{R}^{4}blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and the inequality cuts away one of the two directions.

With that notation, we see that each of the 120120120120 cells of H𝐻Hitalic_H is a simplicial cone spanned by four out of the 30303030 rays RSsubscript𝑅𝑆R_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT with S[4]0𝑆subscriptdelimited-[]40\emptyset\subsetneq S\subsetneq[4]_{0}∅ ⊊ italic_S ⊊ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. For each such set S𝑆Sitalic_S, denote its complement by S¯[4]0S¯𝑆subscriptdelimited-[]40𝑆\bar{S}\coloneqq[4]_{0}\setminus Sover¯ start_ARG italic_S end_ARG ≔ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∖ italic_S. Let us use a generating vector rS4subscript𝑟𝑆superscript4r_{S}\in\mathbb{R}^{4}italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT for each of these rays such that RS=conerSsubscript𝑅𝑆conesubscript𝑟𝑆R_{S}=\operatorname{cone}r_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = roman_cone italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT as follows: If 0S0𝑆0\in S0 ∈ italic_S, then rS𝟙S¯4subscript𝑟𝑆subscript1¯𝑆superscript4r_{S}\coloneqq\mathds{1}_{\bar{S}}\in\mathbb{R}^{4}italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ≔ blackboard_1 start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, otherwise rS𝟙S4subscript𝑟𝑆subscript1𝑆superscript4r_{S}\coloneqq-\mathds{1}_{S}\in\mathbb{R}^{4}italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ≔ - blackboard_1 start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, where for each S[4]𝑆delimited-[]4S\subseteq[4]italic_S ⊆ [ 4 ], the vector 𝟙S4subscript1𝑆superscript4\mathds{1}_{S}\in\mathbb{R}^{4}blackboard_1 start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT contains entries 1111 at precisely those index positions that are contained in S𝑆Sitalic_S and entries 00 elsewhere. For example, r{0,2,3}=(1,0,0,1)4subscript𝑟0231001superscript4r_{\{0,2,3\}}=(1,0,0,1)\in\mathbb{R}^{4}italic_r start_POSTSUBSCRIPT { 0 , 2 , 3 } end_POSTSUBSCRIPT = ( 1 , 0 , 0 , 1 ) ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and r{1,4}=(1,0,0,1)4subscript𝑟141001superscript4r_{\{1,4\}}=(-1,0,0,-1)\in\mathbb{R}^{4}italic_r start_POSTSUBSCRIPT { 1 , 4 } end_POSTSUBSCRIPT = ( - 1 , 0 , 0 , - 1 ) ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. Then, the set R𝑅Ritalic_R containing conic generators of all the 30303030 rays of H𝐻Hitalic_H consists of the 30303030 vectors R=({0,1}4{0,1}4){0}4𝑅superscript014superscript014superscript04R=(\{0,1\}^{4}\cup\{0,-1\}^{4})\setminus\{0\}^{4}italic_R = ( { 0 , 1 } start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∪ { 0 , - 1 } start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) ∖ { 0 } start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT.

Let 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT be the space of all H𝐻Hitalic_H-conforming CPWL functions of type 4superscript4\mathbb{R}^{4}\to\mathbb{R}blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT → blackboard_R. We show that 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT is a 30303030-dimensional vector space.

Lemma 2.5.

The map g(g(r))rRmaps-to𝑔subscript𝑔𝑟𝑟𝑅g\mapsto(g(r))_{r\in R}italic_g ↦ ( italic_g ( italic_r ) ) start_POSTSUBSCRIPT italic_r ∈ italic_R end_POSTSUBSCRIPT that evaluates a function g𝒮30𝑔superscript𝒮30g\in\mathcal{S}^{30}italic_g ∈ caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT at the 30303030 rays in R𝑅Ritalic_R is an isomorphism between 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT and 30superscript30\mathbb{R}^{30}blackboard_R start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT. In particular, 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT is a 30303030/̄dimensional vector space.

Proof.

First note that 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT is closed under addition and scalar multiplication. Therefore, it is a subspace of the vector space of continuous functions of type 4superscript4\mathbb{R}^{4}\to\mathbb{R}blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT → blackboard_R, and thus, in particular, a vector space. We show that the map g(g(r))rRmaps-to𝑔subscript𝑔𝑟𝑟𝑅g\mapsto(g(r))_{r\in R}italic_g ↦ ( italic_g ( italic_r ) ) start_POSTSUBSCRIPT italic_r ∈ italic_R end_POSTSUBSCRIPT is in fact a vector space isomorphism. The map is obviously linear, so we only need to show that it is a bijection. In order to do so, remember that 4superscript4\mathbb{R}^{4}blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT is the union of the 5!=12051205!=1205 ! = 120 simplicial cones Cπsubscript𝐶𝜋C_{\pi}italic_C start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. In particular, given the function values on the extreme rays of these cones, there is a unique positively homogeneous, continuous continuation that is linear within each of the 120 cones. This implies that the considered map is a bijection between 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT and 30superscript30\mathbb{R}^{30}blackboard_R start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT. ∎

The previous lemma also provides a canonical basis of the vector space 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT: the one consisting of all CPWL functions attaining value 1111 at one ray rR𝑟𝑅r\in Ritalic_r ∈ italic_R and value 00 at all other rays. However, it turns out that for our purposes it is more convenient to work with a different basis. To this end, let gM(x)=maxiMxisubscript𝑔𝑀𝑥subscript𝑖𝑀subscript𝑥𝑖g_{M}(x)=\max_{i\in M}x_{i}italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x ) = roman_max start_POSTSUBSCRIPT italic_i ∈ italic_M end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each M[4]0𝑀subscriptdelimited-[]40M\subseteq[4]_{0}italic_M ⊆ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with M{,{0}}𝑀0M\notin\{\emptyset,\{0\}\}italic_M ∉ { ∅ , { 0 } }. These 30303030 functions contain, among other functions, the four (linear) coordinate projections g{i}(x)=xisubscript𝑔𝑖𝑥subscript𝑥𝑖g_{\{i\}}(x)=x_{i}italic_g start_POSTSUBSCRIPT { italic_i } end_POSTSUBSCRIPT ( italic_x ) = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i[4]𝑖delimited-[]4i\in[4]italic_i ∈ [ 4 ], and the function f(x)=g[4]0(x)=max{0,x1,x2,x3,x4}𝑓𝑥subscript𝑔subscriptdelimited-[]40𝑥0subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4f(x)=g_{[4]_{0}}(x)=\max\{0,x_{1},x_{2},x_{3},x_{4}\}italic_f ( italic_x ) = italic_g start_POSTSUBSCRIPT [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }.

Lemma 2.6.

The 30303030 functions gM(x)=maxiMxisubscript𝑔𝑀𝑥subscript𝑖𝑀subscript𝑥𝑖g_{M}(x)=\max_{i\in M}x_{i}italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x ) = roman_max start_POSTSUBSCRIPT italic_i ∈ italic_M end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with {,{0}}∌M[4]0not-contains0𝑀subscriptdelimited-[]40\{\emptyset,\{0\}\}\not\ni M\subseteq[4]_{0}{ ∅ , { 0 } } ∌ italic_M ⊆ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT form a basis of 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT.

Proof.

Evaluating the 30303030 functions gMsubscript𝑔𝑀g_{M}italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT at all 30303030 rays rR𝑟𝑅r\in Ritalic_r ∈ italic_R yields 30303030 vectors in 30superscript30\mathbb{R}^{30}blackboard_R start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT. It can be easily verified (e.g., using a computer) that these vectors form a basis of 30superscript30\mathbb{R}^{30}blackboard_R start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT. Thus, due to the isomorphism of Lemma 2.5, the functions gMsubscript𝑔𝑀g_{M}italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT form a basis of 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT. ∎

Next, we focus on particular subspaces of 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT generated by only some of the 30303030 functions gMsubscript𝑔𝑀g_{M}italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. We prove that they correspond to the spaces of functions computable by H𝐻Hitalic_H-conforming 2222- and 3333-layer NNs, respectively.

To do so, let 14superscript14\mathcal{B}^{14}caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT be the set of the 14141414 basis functions gMsubscript𝑔𝑀g_{M}italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT with {,{0}}∌M[4]0not-contains0𝑀subscriptdelimited-[]40\{\emptyset,\{0\}\}\not\ni M\subseteq[4]_{0}{ ∅ , { 0 } } ∌ italic_M ⊆ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and |M|2𝑀2\lvert M\rvert\leq 2| italic_M | ≤ 2. Let 𝒮14superscript𝒮14\mathcal{S}^{14}caligraphic_S start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT be the 14141414-dimensional subspace spanned by 14superscript14\mathcal{B}^{14}caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT. Similarly, let 29superscript29\mathcal{B}^{29}caligraphic_B start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT be the set of the 29292929 basis functions gMsubscript𝑔𝑀g_{M}italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT with {,{0}}∌M[4]0not-contains0𝑀subscriptdelimited-[]40\{\emptyset,\{0\}\}\not\ni M\subsetneq[4]_{0}{ ∅ , { 0 } } ∌ italic_M ⊊ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (all but [4]0subscriptdelimited-[]40[4]_{0}[ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT). Let 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT be the 29292929-dimensional subspace spanned by 29superscript29\mathcal{B}^{29}caligraphic_B start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT.

Lemma 2.7.

The space 𝒮14superscript𝒮14\mathcal{S}^{14}caligraphic_S start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT consists of all functions computable by H𝐻Hitalic_H-conforming 2222-layer NNs.

Proof.

Each function in 𝒮14superscript𝒮14\mathcal{S}^{14}caligraphic_S start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT is a linear combination of 2222-term max functions by definition. Hence, by Lemma 1.2, it can be represented by a 2-layer NN.

Conversely, we show that any function representable by a 2-layer NN is indeed contained in 𝒮14superscript𝒮14\mathcal{S}^{14}caligraphic_S start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT. It suffices to show that the output of every neuron in the first (and only) hidden layer of an H𝐻Hitalic_H-conforming ReLU NN is in 𝒮14superscript𝒮14\mathcal{S}^{14}caligraphic_S start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT because the output of a 2-layer NN is a linear combination of such outputs. Let a4𝑎superscript4a\in\mathbb{R}^{4}italic_a ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT be the first-layer weights of such a neuron, computing the function ga(x)max{aTx,0}subscript𝑔𝑎𝑥superscript𝑎𝑇𝑥0g_{a}(x)\coloneqq\max\{a^{T}x,0\}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_x ) ≔ roman_max { italic_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x , 0 }, which has the hyperplane {x4aTx=0}conditional-set𝑥superscript4superscript𝑎𝑇𝑥0\{x\in\mathbb{R}^{4}\mid a^{T}x=0\}{ italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∣ italic_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x = 0 } as breakpoints (or is constantly zero). Since the NN must be H𝐻Hitalic_H-conforming, this must be one of the ten hyperplanes xi=xjsubscript𝑥𝑖subscript𝑥𝑗x_{i}=x_{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 0i<j40𝑖𝑗40\leq i<j\leq 40 ≤ italic_i < italic_j ≤ 4. Thus, ga(x)=max{λ(xixj),0}subscript𝑔𝑎𝑥𝜆subscript𝑥𝑖subscript𝑥𝑗0g_{a}(x)=\max\{\lambda(x_{i}-x_{j}),0\}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_x ) = roman_max { italic_λ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , 0 } for some λ𝜆\lambda\in\mathbb{R}italic_λ ∈ blackboard_R. If λ0𝜆0\lambda\geq 0italic_λ ≥ 0, it follows that ga=λg{i,j}λg{j}𝒮14subscript𝑔𝑎𝜆subscript𝑔𝑖𝑗𝜆subscript𝑔𝑗superscript𝒮14g_{a}=\lambda g_{\{i,j\}}-\lambda g_{\{j\}}\in\mathcal{S}^{14}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_λ italic_g start_POSTSUBSCRIPT { italic_i , italic_j } end_POSTSUBSCRIPT - italic_λ italic_g start_POSTSUBSCRIPT { italic_j } end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT, and if λ0𝜆0\lambda\leq 0italic_λ ≤ 0, we obtain ga=λg{i,j}+λg{i}𝒮14subscript𝑔𝑎𝜆subscript𝑔𝑖𝑗𝜆subscript𝑔𝑖superscript𝒮14g_{a}=-\lambda g_{\{i,j\}}+\lambda g_{\{i\}}\in\mathcal{S}^{14}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = - italic_λ italic_g start_POSTSUBSCRIPT { italic_i , italic_j } end_POSTSUBSCRIPT + italic_λ italic_g start_POSTSUBSCRIPT { italic_i } end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT. This concludes the proof. ∎

For 3-layer NNs, an analogous statement can be made. However, only one direction can be easily seen.

Lemma 2.8.

Any function in 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT can be represented by an H𝐻Hitalic_H-conforming 3333-layer NN.

Proof.

As in the previous lemma, each function in 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT is a linear combination of 4444-term max functions by definition. Hence, by Lemma 1.2, it can be represented by a 3-layer NN. ∎

Our goal is to prove the converse as well: any H𝐻Hitalic_H-conforming function represented by a 3-layer NN is in 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT. Since f(x)=max{0,x1,x2,x3,x4}𝑓𝑥0subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4f(x)=\max\{0,x_{1},x_{2},x_{3},x_{4}\}italic_f ( italic_x ) = roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT } is the 30th basis function, which is linearly independent from 29superscript29\mathcal{B}^{29}caligraphic_B start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT and thus not contained in 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT, this implies Theorem 1.7. To achieve this goal, we first provide another characterization of 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT, which can be seen as an orthogonal direction to 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT in 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT. For a function g𝒮30𝑔superscript𝒮30g\in\mathcal{S}^{30}italic_g ∈ caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT, let

ϕ(g)S[4]0(1)|S|g(rS)italic-ϕ𝑔subscript𝑆subscriptdelimited-[]40superscript1𝑆𝑔subscript𝑟𝑆\phi(g)\coloneqq\sum_{\emptyset\subsetneq S\subsetneq[4]_{0}}(-1)^{\lvert S% \rvert}g(r_{S})italic_ϕ ( italic_g ) ≔ ∑ start_POSTSUBSCRIPT ∅ ⊊ italic_S ⊊ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( - 1 ) start_POSTSUPERSCRIPT | italic_S | end_POSTSUPERSCRIPT italic_g ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT )

be a linear map from 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT to \mathbb{R}blackboard_R.

Lemma 2.9.

A function g𝒮30𝑔superscript𝒮30g\in\mathcal{S}^{30}italic_g ∈ caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT is contained in 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT if and only if ϕ(g)=0italic-ϕ𝑔0\phi(g)=0italic_ϕ ( italic_g ) = 0.

Proof.

Any g𝒮30𝑔superscript𝒮30g\in\mathcal{S}^{30}italic_g ∈ caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT can be represented as a unique linear combination of the 30303030 basis functions gMsubscript𝑔𝑀g_{M}italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and is contained in 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT if and only if the coefficient of f=g[4]0𝑓subscript𝑔subscriptdelimited-[]40f=g_{[4]_{0}}italic_f = italic_g start_POSTSUBSCRIPT [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is zero. One can easily check (with a computer) that ϕitalic-ϕ\phiitalic_ϕ maps all functions in 29superscript29\mathcal{B}^{29}caligraphic_B start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT to 00, but not the 30th basis function f𝑓fitalic_f. Thus, g𝑔gitalic_g is contained in 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT if and only if it satisfies ϕ(g)=0italic-ϕ𝑔0\phi(g)=0italic_ϕ ( italic_g ) = 0. ∎

In order to make use of our assumption that the NN is H𝐻Hitalic_H-conforming, we need the following insight about when the property of being H𝐻Hitalic_H-conforming is preserved after applying a ReLU activation.

Lemma 2.10.

Let g𝒮30𝑔superscript𝒮30g\in\mathcal{S}^{30}italic_g ∈ caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT. The function h=σg𝜎𝑔h=\sigma\circ gitalic_h = italic_σ ∘ italic_g is H𝐻Hitalic_H-conforming (and thus in 𝒮30superscript𝒮30\mathcal{S}^{30}caligraphic_S start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT as well) if and only if there is no pair of sets SS[4]0𝑆superscript𝑆subscriptdelimited-[]40\emptyset\subsetneq S\subsetneq S^{\prime}\subsetneq[4]_{0}∅ ⊊ italic_S ⊊ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊊ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with g(rS)𝑔subscript𝑟𝑆g(r_{S})italic_g ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) and g(rS)𝑔subscript𝑟superscript𝑆g(r_{S^{\prime}})italic_g ( italic_r start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) being nonzero and having different signs.

Proof.

The key observation to prove this lemma is the following: for two rays rSsubscript𝑟𝑆r_{S}italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and rSsubscript𝑟superscript𝑆r_{S^{\prime}}italic_r start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, there exists a cell C𝐶Citalic_C of the hyperplane arrangement H𝐻Hitalic_H for which both rSsubscript𝑟𝑆r_{S}italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and rSsubscript𝑟superscript𝑆r_{S^{\prime}}italic_r start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are extreme rays if and only if SS𝑆superscript𝑆S\subsetneq S^{\prime}italic_S ⊊ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT oder SSsuperscript𝑆𝑆S^{\prime}\subsetneq Sitalic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊊ italic_S.

Hence, if there exists a pair of sets SS[4]0𝑆superscript𝑆subscriptdelimited-[]40\emptyset\subsetneq S\subsetneq S^{\prime}\subsetneq[4]_{0}∅ ⊊ italic_S ⊊ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊊ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with g(rS)𝑔subscript𝑟𝑆g(r_{S})italic_g ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) and g(rS)𝑔subscript𝑟superscript𝑆g(r_{S^{\prime}})italic_g ( italic_r start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) being nonzero and having different signs, then the function g𝑔gitalic_g restricted to C𝐶Citalic_C is a linear function with both strictly positive and strictly negative values. Therefore, after applying the ReLU activation, the resulting function hhitalic_h has breakpoints within C𝐶Citalic_C and is not H𝐻Hitalic_H-conforming.

Conversely, if for each pair of sets SS[4]0𝑆superscript𝑆subscriptdelimited-[]40\emptyset\subsetneq S\subsetneq S^{\prime}\subsetneq[4]_{0}∅ ⊊ italic_S ⊊ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊊ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, both g(rS)𝑔subscript𝑟𝑆g(r_{S})italic_g ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) and g(rS)𝑔subscript𝑟superscript𝑆g(r_{S^{\prime}})italic_g ( italic_r start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) are either nonpositive or nonnegative, then g𝑔gitalic_g restricted to any cell C𝐶Citalic_C of H𝐻Hitalic_H is either nonpositive or nonnegative everywhere. In the first case, hhitalic_h restricted to that cell C𝐶Citalic_C is the zero function, while in the second case, hhitalic_h coincides with g𝑔gitalic_g in C𝐶Citalic_C. In both cases, hhitalic_h is linear within all cells and, thus, H𝐻Hitalic_H-conforming. ∎

Having collected all these lemmas, we are finally able to construct an MIP whose solution proves that any function computed by an H𝐻Hitalic_H-conforming 3-layer NN is in 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT. As in the proof of Lemma 2.7, it suffices to focus on the output of a single neuron in the second hidden layer. Let h=σg𝜎𝑔h=\sigma\circ gitalic_h = italic_σ ∘ italic_g be the output of such a neuron with g𝑔gitalic_g being its input. Observe that, by construction, g𝑔gitalic_g is a function computed by a 2222-layer NN, and thus, by Lemma 2.7, a linear combination of the 14141414 functions in 14superscript14\mathcal{B}^{14}caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT. The MIP contains three types of variables, which we denote in bold to distinguish them from constants:

  • 14141414 continuous variables 𝐚M[1,1]subscript𝐚𝑀11\mathbf{a}_{M}\in[-1,1]bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ [ - 1 , 1 ], being the coefficients of the linear combination of the basis of 𝒮14superscript𝒮14\mathcal{S}^{14}caligraphic_S start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT forming g𝑔gitalic_g, that is, g=gM14𝐚MgM𝑔subscriptsubscript𝑔𝑀superscript14subscript𝐚𝑀subscript𝑔𝑀g=\sum_{g_{M}\in\mathcal{B}^{14}}\mathbf{a}_{M}g_{M}italic_g = ∑ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT (since multiplying g𝑔gitalic_g and hhitalic_h with a nonzero scalar does not alter the containment of hhitalic_h in 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT, we may restrict the variables to [1,1]11[-1,1][ - 1 , 1 ]),

  • 30303030 binary variables 𝐳S{0,1}subscript𝐳𝑆01\mathbf{z}_{S}\in\{0,1\}bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ { 0 , 1 } for S[4]0𝑆subscriptdelimited-[]40\emptyset\subsetneq S\subsetneq[4]_{0}∅ ⊊ italic_S ⊊ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, determining whether the considered neuron is strictly active at ray rSsubscript𝑟𝑆r_{S}italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, that is, whether g(rS)>0𝑔subscript𝑟𝑆0g(r_{S})>0italic_g ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) > 0,

  • 30303030 continuous variables 𝐲Ssubscript𝐲𝑆\mathbf{y}_{S}\in\mathbb{R}bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ blackboard_R for S[4]0𝑆subscriptdelimited-[]40\emptyset\subsetneq S\subsetneq[4]_{0}∅ ⊊ italic_S ⊊ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, representing the output of the considered neuron at all rays, that is, 𝐲S=h(rS)subscript𝐲𝑆subscript𝑟𝑆\mathbf{y}_{S}=h(r_{S})bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_h ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ).

To ensure that these variables interact as expected, we need two types of constraints:

  • For each of the 30303030 rays rSsubscript𝑟𝑆r_{S}italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, S[4]0𝑆subscriptdelimited-[]40\emptyset\subsetneq S\subsetneq[4]_{0}∅ ⊊ italic_S ⊊ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the following constraints ensure that 𝐳Ssubscript𝐳𝑆\mathbf{z}_{S}bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and output 𝐲Ssubscript𝐲𝑆\mathbf{y}_{S}bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT are correctly calculated from the variables 𝐚Msubscript𝐚𝑀\mathbf{a}_{M}bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, that is, 𝐳S=1subscript𝐳𝑆1\mathbf{z}_{S}=1bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = 1 if and only if g(rS)=gM14𝐚MgM(rS)𝑔subscript𝑟𝑆subscriptsubscript𝑔𝑀superscript14subscript𝐚𝑀subscript𝑔𝑀subscript𝑟𝑆g(r_{S})=\sum_{g_{M}\in\mathcal{B}^{14}}\mathbf{a}_{M}g_{M}(r_{S})italic_g ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) is positive, and 𝐲S=max{0,g(rS)}subscript𝐲𝑆0𝑔subscript𝑟𝑆\mathbf{y}_{S}=\max\{0,g(r_{S})\}bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = roman_max { 0 , italic_g ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) }. Also compare the references given in Section 1.5 concerning MIP models for ReLU units. Note that the restriction of the coefficients 𝐚Msubscript𝐚𝑀\mathbf{a}_{M}bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT to [1,1]11[-1,1][ - 1 , 1 ] ensures that the absolute value of g(rS)𝑔subscript𝑟𝑆g(r_{S})italic_g ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) is always bounded by 14141414, allowing us to use 15151515 as a replacement for ++\infty+ ∞:

    𝐲S0𝐲SgM14𝐚MgM(rS)𝐲S15𝐳S𝐲SgM14𝐚MgM(rS)+15(1𝐳S)subscript𝐲𝑆0subscript𝐲𝑆subscriptsubscript𝑔𝑀superscript14subscript𝐚𝑀subscript𝑔𝑀subscript𝑟𝑆subscript𝐲𝑆15subscript𝐳𝑆subscript𝐲𝑆subscriptsubscript𝑔𝑀superscript14subscript𝐚𝑀subscript𝑔𝑀subscript𝑟𝑆151subscript𝐳𝑆\displaystyle\begin{split}\mathbf{y}_{S}&\geq 0\\ \mathbf{y}_{S}&\geq\sum_{g_{M}\in\mathcal{B}^{14}}\mathbf{a}_{M}g_{M}(r_{S})\\ \mathbf{y}_{S}&\leq 15\mathbf{z}_{S}\\ \mathbf{y}_{S}&\leq\sum_{g_{M}\in\mathcal{B}^{14}}\mathbf{a}_{M}g_{M}(r_{S})+1% 5(1-\mathbf{z}_{S})\end{split}start_ROW start_CELL bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_CELL start_CELL ≥ 0 end_CELL end_ROW start_ROW start_CELL bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_CELL start_CELL ≥ ∑ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_CELL start_CELL ≤ 15 bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) + 15 ( 1 - bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_CELL end_ROW (2)

    Observe that these constraints ensure that one of the following two cases occurs: If 𝐳S=0subscript𝐳𝑆0\mathbf{z}_{S}=0bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = 0, then the first and third line imply 𝐲S=0subscript𝐲𝑆0\mathbf{y}_{S}=0bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = 0 and the second line implies that the incoming activation is in fact nonpositive. The fourth line is always satisfied in that case. Otherwise, if 𝐳S=1subscript𝐳𝑆1\mathbf{z}_{S}=1bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = 1, then the second and fourth line imply that 𝐲Ssubscript𝐲𝑆\mathbf{y}_{S}bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT equals the incoming activation, and, in combination with the first line, this has to be nonnegative. The third line is always satisfied in that case. Hence, the set of constraints (2) correctly models the ReLU activation function.

  • For each of the 150150150150 pairs of sets SS[4]0𝑆superscript𝑆subscriptdelimited-[]40\emptyset\subsetneq S\subsetneq S^{\prime}\subsetneq[4]_{0}∅ ⊊ italic_S ⊊ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊊ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the following constraints ensure that the property in Lemma 2.10 is satisfied. More precisely, if one of the variables 𝐳Ssubscript𝐳𝑆\mathbf{z}_{S}bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT oder 𝐳Ssubscript𝐳superscript𝑆\mathbf{z}_{S^{\prime}}bold_z start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT equals 1111, then the ray of the other set has nonnegative activation, that is, g(rS)0𝑔subscript𝑟superscript𝑆0g(r_{S^{\prime}})\geq 0italic_g ( italic_r start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≥ 0 oder g(rS)0𝑔subscript𝑟𝑆0g(r_{S})\geq 0italic_g ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ≥ 0, respectively:

    gM14𝐚MgM(rS)15(𝐳S1)gM14𝐚MgM(rS)15(𝐳S1)subscriptsubscript𝑔𝑀superscript14subscript𝐚𝑀subscript𝑔𝑀subscript𝑟𝑆15subscript𝐳superscript𝑆1subscriptsubscript𝑔𝑀superscript14subscript𝐚𝑀subscript𝑔𝑀subscript𝑟superscript𝑆15subscript𝐳𝑆1\displaystyle\begin{split}\sum_{g_{M}\in\mathcal{B}^{14}}\mathbf{a}_{M}g_{M}(r% _{S})&\geq 15(\mathbf{z}_{S^{\prime}}-1)\\ \sum_{g_{M}\in\mathcal{B}^{14}}\mathbf{a}_{M}g_{M}(r_{S^{\prime}})&\geq 15(% \mathbf{z}_{S}-1)\end{split}start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_CELL start_CELL ≥ 15 ( bold_z start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - 1 ) end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_CELL start_CELL ≥ 15 ( bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - 1 ) end_CELL end_ROW (3)

    Observe that these constraints successfully prevent that the two rays rSsubscript𝑟𝑆r_{S}italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and rSsubscript𝑟superscript𝑆r_{S^{\prime}}italic_r start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT have nonzero activations with different signs. Conversely, if this is not the case, then we can always satisfy constraints (3) by setting only those variables 𝐳Ssubscript𝐳𝑆\mathbf{z}_{S}bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to value 1111 where the activation of ray rSsubscript𝑟𝑆r_{S}italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is strictly positive. (Note that, if the incoming activation is precisely zero, constraints (2) make it possible to choose both values 00 oder 1111 for 𝐳Ssubscript𝐳𝑆\mathbf{z}_{S}bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT.) Hence, these constraints are in fact appropriate to model H𝐻Hitalic_H-conformity.

In the light of Lemma 2.9, the objective function of our MIP is to maximize ϕ(h)italic-ϕ\phi(h)italic_ϕ ( italic_h ), that is, the expression

S[4]0(1)|S|𝐲S.subscript𝑆subscriptdelimited-[]40superscript1𝑆subscript𝐲𝑆\sum_{\emptyset\subsetneq S\subsetneq[4]_{0}}(-1)^{\lvert S\rvert}\mathbf{y}_{% S}.∑ start_POSTSUBSCRIPT ∅ ⊊ italic_S ⊊ [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( - 1 ) start_POSTSUPERSCRIPT | italic_S | end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT .

The MIP has a total of 30 binary and 44 continuous variables, as well as 420 inequality constraints. The next proposition formalizes how this MIP can be used to check whether a 3-layer NN function can exist outside 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT.

Proposition 2.11.

There exists an H𝐻Hitalic_H-conforming 3-layer NN computing a function not contained in 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT if and only if the objective value of the MIP defined above is strictly positive.

Proof.

For the first direction, assume that such an NN exists. Since its final output is a linear combination of the outputs of the neurons in the second hidden layer, one of these neurons must compute a function h~=σg~𝒮29~𝜎~𝑔superscript𝒮29\tilde{h}=\sigma\circ\tilde{g}\notin\mathcal{S}^{29}over~ start_ARG italic_h end_ARG = italic_σ ∘ over~ start_ARG italic_g end_ARG ∉ caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT, with g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG being the input to that neuron. By Lemma 2.9, it follows that ϕ(h~)0italic-ϕ~0\phi(\tilde{h})\neq 0italic_ϕ ( over~ start_ARG italic_h end_ARG ) ≠ 0. Moreover, we can even assume without loss of generality that ϕ(h~)>0italic-ϕ~0\phi(\tilde{h})>0italic_ϕ ( over~ start_ARG italic_h end_ARG ) > 0, as we argue now. If this is not the case, multiply all first-layer weights of the NN by 11-1- 1 to obtain a new NN computing function h^^\hat{h}over^ start_ARG italic_h end_ARG instead of h~~\tilde{h}over~ start_ARG italic_h end_ARG. Observing that rS=r[4]0Ssubscript𝑟𝑆subscript𝑟subscriptdelimited-[]40𝑆r_{S}=-r_{[4]_{0}\setminus S}italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = - italic_r start_POSTSUBSCRIPT [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∖ italic_S end_POSTSUBSCRIPT for all rSRsubscript𝑟𝑆𝑅r_{S}\in Ritalic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ italic_R, we obtain h^(rS)=h~(rS)=h~(r[4]0S)^subscript𝑟𝑆~subscript𝑟𝑆~subscript𝑟subscriptdelimited-[]40𝑆\hat{h}(r_{S})=\tilde{h}(-r_{S})=\tilde{h}(r_{[4]_{0}\setminus S})over^ start_ARG italic_h end_ARG ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = over~ start_ARG italic_h end_ARG ( - italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = over~ start_ARG italic_h end_ARG ( italic_r start_POSTSUBSCRIPT [ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∖ italic_S end_POSTSUBSCRIPT ) for all rSRsubscript𝑟𝑆𝑅r_{S}\in Ritalic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ italic_R. Plugging this into the definition of ϕitalic-ϕ\phiitalic_ϕ and using that the cardinalities of S𝑆Sitalic_S and [4]0Ssubscriptdelimited-[]40𝑆[4]_{0}\setminus S[ 4 ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∖ italic_S have different parity, we further obtain ϕ(h^)=ϕ(h~)italic-ϕ^italic-ϕ~\phi(\hat{h})=-\phi(\tilde{h})italic_ϕ ( over^ start_ARG italic_h end_ARG ) = - italic_ϕ ( over~ start_ARG italic_h end_ARG ). Therefore, we can assume that ϕ(h~)italic-ϕ~\phi(\tilde{h})italic_ϕ ( over~ start_ARG italic_h end_ARG ) was already positive in the first place.

Using Lemma 2.7, the function g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG can be represented as a linear combination g~=gM14𝐚~MgM~𝑔subscriptsubscript𝑔𝑀superscript14subscript~𝐚𝑀subscript𝑔𝑀\tilde{g}=\sum_{g_{M}\in\mathcal{B}^{14}}\mathbf{\tilde{a}}_{M}g_{M}over~ start_ARG italic_g end_ARG = ∑ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT of the functions in 14superscript14\mathcal{B}^{14}caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT. Let αmaxM|𝐚~M|𝛼subscript𝑀subscript~𝐚𝑀\alpha\coloneqq\max_{M}\lvert\mathbf{\tilde{a}}_{M}\rvertitalic_α ≔ roman_max start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT | over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT |. Note that α>0𝛼0\alpha>0italic_α > 0 because otherwise g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG would be the zero function. Let us define modified functions g𝑔gitalic_g and hhitalic_h from g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG and h~~\tilde{h}over~ start_ARG italic_h end_ARG as follows. Let 𝐚M𝐚~M/α[1,1]subscript𝐚𝑀subscript~𝐚𝑀𝛼11\mathbf{a}_{M}\coloneqq\mathbf{\tilde{a}}_{M}/\alpha\in[-1,1]bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ≔ over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT / italic_α ∈ [ - 1 , 1 ], ggM14𝐚MgM𝑔subscriptsubscript𝑔𝑀superscript14subscript𝐚𝑀subscript𝑔𝑀g\coloneqq\sum_{g_{M}\in\mathcal{B}^{14}}\mathbf{a}_{M}g_{M}italic_g ≔ ∑ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, and hσg𝜎𝑔h\coloneqq\sigma\circ gitalic_h ≔ italic_σ ∘ italic_g. Moreover, for all rays rSRsubscript𝑟𝑆𝑅r_{S}\in Ritalic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ italic_R, let 𝐲Sh(rS)subscript𝐲𝑆subscript𝑟𝑆\mathbf{y}_{S}\coloneqq h(r_{S})bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ≔ italic_h ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ), as well as 𝐳S1subscript𝐳𝑆1\mathbf{z}_{S}\coloneqq 1bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ≔ 1 if 𝐲S>0subscript𝐲𝑆0\mathbf{y}_{S}>0bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT > 0, and 𝐳S0subscript𝐳𝑆0\mathbf{z}_{S}\coloneqq 0bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ≔ 0 otherwise.

It is easy to verify that the variables 𝐚Msubscript𝐚𝑀\mathbf{a}_{M}bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, 𝐲Ssubscript𝐲𝑆\mathbf{y}_{S}bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, and 𝐳Ssubscript𝐳𝑆\mathbf{z}_{S}bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT defined that way satisfy (2). Moreover, since the NN is H𝐻Hitalic_H-conforming, they also satisfy (3). Finally, they also yield a strictly positive objective function value since ϕ(h)=ϕ(h~)/α>0italic-ϕitalic-ϕ~𝛼0\phi(h)=\phi(\tilde{h})/\alpha>0italic_ϕ ( italic_h ) = italic_ϕ ( over~ start_ARG italic_h end_ARG ) / italic_α > 0.

For the reverse direction, assume that there exists an MIP solution consisting of 𝐚Msubscript𝐚𝑀\mathbf{a}_{M}bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT𝐲Ssubscript𝐲𝑆\mathbf{y}_{S}bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, and 𝐳Ssubscript𝐳𝑆\mathbf{z}_{S}bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, satisfying (2) and (3), and having a strictly positive objective function value. Define the functions ggM14𝐚MgM𝑔subscriptsubscript𝑔𝑀superscript14subscript𝐚𝑀subscript𝑔𝑀g\coloneqq\sum_{g_{M}\in\mathcal{B}^{14}}\mathbf{a}_{M}g_{M}italic_g ≔ ∑ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ caligraphic_B start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and hσg𝜎𝑔h\coloneqq\sigma\circ gitalic_h ≔ italic_σ ∘ italic_g. One concludes from (2) that h(rS)=𝐲Ssubscript𝑟𝑆subscript𝐲𝑆h(r_{S})=\mathbf{y}_{S}italic_h ( italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = bold_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT for all rays rSRsubscript𝑟𝑆𝑅r_{S}\in Ritalic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ italic_R. Lemma 2.7 implies that g𝑔gitalic_g can be represented by a 2-layer NN. Thus, hhitalic_h can be represented by a 3-layer NN. Moreover, constraints (3) guarantee that this NN is H𝐻Hitalic_H-conforming. Finally, since the MIP solution has strictly positive objective function value, we obtain ϕ(h)>0italic-ϕ0\phi(h)>0italic_ϕ ( italic_h ) > 0, implying that h𝒮29superscript𝒮29h\notin\mathcal{S}^{29}italic_h ∉ caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT. ∎

In order to use the MIP as part of a mathematical proof, we employed an MIP solver that uses exact rational arithmetics without numerical errors, namely the solver by the Parma Polyhedral Library (PPL) [7]. We called the solver from a SageMath (Version 9.0) [71] script on a machine with an Intel Core i7-8700 6-Core 64-bit CPU and 15.5 GB RAM, using the openSUSE Leap 15.2 Linux distribution. SageMath, which natively includes the PPL solver, is published under the GPLv3 license. After a total running time of almost 7 days (153 hours), we obtained optimal objective function value zero. This makes it possible to prove Theorem 1.7.

Proof of Theorem 1.7.

Since the MIP has optimal objective function value zero, Proposition 2.11 implies that any function computed by an H𝐻Hitalic_H-conforming 3333-layer NN is contained in 𝒮29superscript𝒮29\mathcal{S}^{29}caligraphic_S start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT. In particular, it is not possible to compute the function f(x)=max{0,x1,x2,x3,x4}𝑓𝑥0subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4f(x)=\max\{0,x_{1},x_{2},x_{3},x_{4}\}italic_f ( italic_x ) = roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT } with an H𝐻Hitalic_H-conforming 3333-layer NN. ∎

We remark that state-of-the-art MIP solver Gurobi (version 9.1.1) [30], which is commercial but offers free academic licenses, is able to solve the same MIP within less than a second, providing the same result. However, Gurobi does not employ exact arithmetics, making it impossible to exclude numerical errors and use it as a mathematical proof.

The SageMath code can be found on GitHub at

https://github.com/ChristophHertrich/relu-mip-depth-bound.

Additionally, the MIP can be found there as .mps file, a standard format to represent MIPs. This allows one to use any solver of choice to reproduce our result.

3 Going Beyond Linear Combinations of Max Functions

In this section we prove the following result, showing that NNs with k𝑘kitalic_k hidden layers can compute more functions than only linear combinations of 2ksuperscript2𝑘2^{k}2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT-term max functions.

See 1.8

In order to prove this theorem, for each number of hidden layers k2𝑘2k\geq 2italic_k ≥ 2, we provide a specific function in ReLU(k)MAX(2k)ReLU𝑘MAXsuperscript2𝑘\operatorname{ReLU}(k)\setminus\operatorname{MAX}(2^{k})roman_ReLU ( italic_k ) ∖ roman_MAX ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ). The challenging part is to show that the function is in fact not contained in MAX(2k)MAXsuperscript2𝑘\operatorname{MAX}(2^{k})roman_MAX ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).

Proposition 3.1.

For any n3𝑛3n\geq 3italic_n ≥ 3, the function f:n:𝑓superscript𝑛f\colon\mathbb{R}^{n}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R defined by

f(x)=max{0,x1,x2,,xn3,max{xn2,xn1}+max{0,xn}}𝑓𝑥0subscript𝑥1subscript𝑥2subscript𝑥𝑛3subscript𝑥𝑛2subscript𝑥𝑛10subscript𝑥𝑛f(x)=\max\{0,x_{1},x_{2},\dots,x_{n-3},\,\max\{x_{n-2},x_{n-1}\}+\max\{0,x_{n}\}\}italic_f ( italic_x ) = roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n - 3 end_POSTSUBSCRIPT , roman_max { italic_x start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT } + roman_max { 0 , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } } (4)

is not contained in MAX(n)MAX𝑛\operatorname{MAX}(n)roman_MAX ( italic_n ).

This means that f𝑓fitalic_f cannot be written as a linear combination of n𝑛nitalic_n-term max functions, which proves a conjecture by [74] that MAXn(n)CPWLnsubscriptMAX𝑛𝑛subscriptCPWL𝑛\operatorname{MAX}_{n}(n)\subsetneq\operatorname{CPWL}_{n}roman_MAX start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_n ) ⊊ roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which has been open since 2005. Previously, it was only known that linear combinations of (n1)𝑛1(n-1)( italic_n - 1 )-term maxes are not sufficient to represent any CPWL function defined on nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, that is, MAXn(n1)CPWLnsubscriptMAX𝑛𝑛1subscriptCPWL𝑛\operatorname{MAX}_{n}(n-1)\subsetneq\operatorname{CPWL}_{n}roman_MAX start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_n - 1 ) ⊊ roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Lu [48] provides a short analytical argument for this fact.

Before we prove Proposition 3.1, we show that it implies Theorem 1.8.

Proof of Theorem 1.8.

For k2𝑘2k\geq 2italic_k ≥ 2, let n2k𝑛superscript2𝑘n\coloneqq 2^{k}italic_n ≔ 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. By Proposition 3.1, function f𝑓fitalic_f defined in (4) is not contained in MAX(2k)MAXsuperscript2𝑘\operatorname{MAX}(2^{k})roman_MAX ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ). It remains to show that it can be represented using a ReLU NN with k𝑘kitalic_k hidden layers. To see this, first observe that any of the n/2=2k1𝑛2superscript2𝑘1n/2=2^{k-1}italic_n / 2 = 2 start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT terms max{0,x1}0subscript𝑥1\max\{0,x_{1}\}roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }, max{x2i,x2i+1}subscript𝑥2𝑖subscript𝑥2𝑖1\max\{x_{2i},x_{2i+1}\}roman_max { italic_x start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 italic_i + 1 end_POSTSUBSCRIPT } for i[n/22]𝑖delimited-[]𝑛22i\in[n/2-2]italic_i ∈ [ italic_n / 2 - 2 ], and max{xn2,xn1}+max{0,xn}subscript𝑥𝑛2subscript𝑥𝑛10subscript𝑥𝑛\max\{x_{n-2},x_{n-1}\}+\max\{0,x_{n}\}roman_max { italic_x start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT } + roman_max { 0 , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } can be expressed by a one-hidden-layer NN since all these are (linear combinations of) 2222-term max functions. Since f𝑓fitalic_f is the maximum of these 2k1superscript2𝑘12^{k-1}2 start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT terms, and since the maximum of 2k1superscript2𝑘12^{k-1}2 start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT numbers can be computed with k1𝑘1k-1italic_k - 1 hidden layers (Lemma 1.2), this implies that f𝑓fitalic_f is in ReLU(k)ReLU𝑘\operatorname{ReLU}(k)roman_ReLU ( italic_k ). ∎

In order to prove Proposition 3.1, we need the concept of polyhedral complexes. A polyhedral complex 𝒫𝒫\mathcal{P}caligraphic_P is a finite set of polyhedra such that each face of a polyhedron in 𝒫𝒫\mathcal{P}caligraphic_P is also in 𝒫𝒫\mathcal{P}caligraphic_P, and for two polyhedra P,Q𝒫𝑃𝑄𝒫P,Q\in\mathcal{P}italic_P , italic_Q ∈ caligraphic_P, their intersection PQ𝑃𝑄P\cap Qitalic_P ∩ italic_Q is a common face of P𝑃Pitalic_P and Q𝑄Qitalic_Q (possibly the empty face). Given a polyhedral complex 𝒫𝒫\mathcal{P}caligraphic_P in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and an integer m[n]𝑚delimited-[]𝑛m\in[n]italic_m ∈ [ italic_n ], we let 𝒫msuperscript𝒫𝑚\mathcal{P}^{m}caligraphic_P start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT denote the collection of all m𝑚mitalic_m-dimensional polyhedra in 𝒫𝒫\mathcal{P}caligraphic_P.

For a convex CPWL function f𝑓fitalic_f, we define its underlying polyhedral complex as follows: it is the unique polyhedral complex covering nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (i.e., each point in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT belongs to some polyhedron in 𝒫𝒫\mathcal{P}caligraphic_P) whose n𝑛nitalic_n-dimensional polyhedra coincide with the domains of the (maximal) affine pieces of f𝑓fitalic_f. In particular, f𝑓fitalic_f is affine linear within each P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P, but not within any strict superset of a polyhedron in 𝒫nsuperscript𝒫𝑛\mathcal{P}^{n}caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

Exploiting properties of polyhedral complexes associated with CPWL functions, we prove the following proposition below.

Proposition 3.2.

Let f0:n:subscript𝑓0superscript𝑛f_{0}\colon\mathbb{R}^{n}\to\mathbb{R}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R be a convex CPWL function and let 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be the underlying polyhedral complex. If there exists a hyperplane Hn𝐻superscript𝑛H\subseteq\mathbb{R}^{n}italic_H ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that the set

T{F𝒫0n1|FH}𝑇conditional-set𝐹superscriptsubscript𝒫0𝑛1𝐹𝐻T\coloneqq\bigcup\left\{F\in\mathcal{P}_{0}^{n-1}\mathrel{}\middle|\mathrel{}F% \subseteq H\right\}italic_T ≔ ⋃ { italic_F ∈ caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT | italic_F ⊆ italic_H }

is nonempty and contains no line, then f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT cannot be expressed as a linear combination of n𝑛nitalic_n-term maxima of affine linear functions.

Again, before we proceed to the proof of Proposition 3.2, we show that it implies Proposition 3.1.

Proof of Proposition 3.1.

Observe that f𝑓fitalic_f (defined in (4)) has the alternate representation

f(x)=max{0,x1,x2,,xn3,xn2,xn1,xn2+xn,xn1+xn}𝑓𝑥0subscript𝑥1subscript𝑥2subscript𝑥𝑛3subscript𝑥𝑛2subscript𝑥𝑛1subscript𝑥𝑛2subscript𝑥𝑛subscript𝑥𝑛1subscript𝑥𝑛f(x)=\max\{0,\,x_{1},\,x_{2},\,\dots,\,x_{n-3},\,x_{n-2},\,x_{n-1},\,x_{n-2}+x% _{n},\,x_{n-1}+x_{n}\}italic_f ( italic_x ) = roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n - 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }

as a maximum of n+2𝑛2n+2italic_n + 2 terms. Let 𝒫𝒫\mathcal{P}caligraphic_P be its underlying polyhedral complex. Let the hyperplane H𝐻Hitalic_H be defined by x1=0subscript𝑥10x_{1}=0italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.

Observe that any facet in 𝒫n1superscript𝒫𝑛1\mathcal{P}^{n-1}caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT is a polyhedron defined by two of the n+2𝑛2n+2italic_n + 2 terms that are equal and at least as large as each of the remaining n𝑛nitalic_n terms. Hence, the only facet that could possibly be contained in H𝐻Hitalic_H is

F{xnx1=0x2,,xn3,xn2,xn1,xn2+xn,xn1+xn}.𝐹conditional-set𝑥superscript𝑛formulae-sequencesubscript𝑥10subscript𝑥2subscript𝑥𝑛3subscript𝑥𝑛2subscript𝑥𝑛1subscript𝑥𝑛2subscript𝑥𝑛subscript𝑥𝑛1subscript𝑥𝑛F\coloneqq\{x\in\mathbb{R}^{n}\mid x_{1}=0\geq x_{2},\,\dots,\,x_{n-3},\,x_{n-% 2},\,x_{n-1},\,x_{n-2}+x_{n},\,x_{n-1}+x_{n}\}.italic_F ≔ { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ≥ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n - 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } .

Note that F𝐹Fitalic_F is indeed an (n1)𝑛1(n-1)( italic_n - 1 )-dimensional facet in 𝒫n1superscript𝒫𝑛1\mathcal{P}^{n-1}caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT, because, for example, a small ball around (0,1,,1)n011superscript𝑛(0,-1,\dots,-1)\in\mathbb{R}^{n}( 0 , - 1 , … , - 1 ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT intersected with H𝐻Hitalic_H is contained in F𝐹Fitalic_F.

Finally, we need to show that F𝐹Fitalic_F is pointed, that is, it contains no line. A well-known fact from polyhedral theory says if there is any line in F𝐹Fitalic_F with direction dn{0}𝑑superscript𝑛0d\in\mathbb{R}^{n}\setminus\{0\}italic_d ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∖ { 0 }, then d𝑑ditalic_d must satisfy the defining inequalities with equality. However, only the zero vector does this. Hence, F𝐹Fitalic_F cannot contain a line.

Therefore, when applying Proposition 3.2 to f𝑓fitalic_f with underlying polyhedral complex 𝒫𝒫\mathcal{P}caligraphic_P and hyperplane H𝐻Hitalic_H, we have T=F𝑇𝐹T=Fitalic_T = italic_F, which is nonempty and contains no line. Hence, f𝑓fitalic_f cannot be written as linear combination of n𝑛nitalic_n-term maxima. ∎

The remainder of this section is devoted to proving Proposition 3.2. In order to exploit properties of the underlying polyhedral complex of the considered CPWL functions, we will first introduce some terminology, notation, and results related to polyhedral complexes in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for any n1𝑛1n\geq 1italic_n ≥ 1.

Definition 3.3.

Given an abelian group (G,+)𝐺(G,+)( italic_G , + ), we define n(G)superscript𝑛𝐺\mathcal{F}^{n}(G)caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ) as the family of all functions ϕitalic-ϕ\phiitalic_ϕ of the form ϕ:𝒫nG:italic-ϕsuperscript𝒫𝑛𝐺\phi\colon\mathcal{P}^{n}\to Gitalic_ϕ : caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → italic_G, where 𝒫𝒫\mathcal{P}caligraphic_P is a polyhedral complex that covers nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We say that 𝒫𝒫\mathcal{P}caligraphic_P is the underlying polyhedral complex, or the polyhedral complex associated with ϕitalic-ϕ\phiitalic_ϕ.

Just to give an intuition of the reason for this definition, let us mention that later we will choose (G,+)𝐺(G,+)( italic_G , + ) to be the set of affine linear maps nsuperscript𝑛\mathbb{R}^{n}\to\mathbb{R}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R with respect to the standard operation of sum of functions. Moreover, given a convex CPWL function f:n:𝑓superscript𝑛f\colon\mathbb{R}^{n}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R with underlying polyhedral complex 𝒫𝒫\mathcal{P}caligraphic_P, we will consider the following function ϕn(G)italic-ϕsuperscript𝑛𝐺\phi\in\mathcal{F}^{n}(G)italic_ϕ ∈ caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ): for every P𝒫n𝑃superscript𝒫𝑛P\in\mathcal{P}^{n}italic_P ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, ϕ(P)italic-ϕ𝑃\phi(P)italic_ϕ ( italic_P ) will be the affine linear map that coincides with f𝑓fitalic_f over P𝑃Pitalic_P. It can be helpful, though not necessary, to keep this in mind when reading the next definitions and observations.

It is useful to observe that the functions in n(G)superscript𝑛𝐺\mathcal{F}^{n}(G)caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ) can also be described in a different way. Before explaining this, we need to define an ordering between the two elements of each pair of opposite halfspaces. More precisely, let H𝐻Hitalic_H be a hyperplane in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and let H,H′′superscript𝐻superscript𝐻′′H^{\prime},H^{\prime\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT be the two closed halfspaces delimited by H𝐻Hitalic_H. We choose an arbitrary rule to say that Hsuperscript𝐻H^{\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT “precedes” H′′superscript𝐻′′H^{\prime\prime}italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT, which we write as HH′′precedessuperscript𝐻superscript𝐻′′H^{\prime}\prec H^{\prime\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≺ italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT.111In case one wants to see such a rule explicitly, this is a possible way: Fix an arbitrary x¯H¯𝑥𝐻\bar{x}\in Hover¯ start_ARG italic_x end_ARG ∈ italic_H. We can say that HH′′precedessuperscript𝐻superscript𝐻′′H^{\prime}\prec H^{\prime\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≺ italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT if and only if x¯+eiH¯𝑥subscript𝑒𝑖superscript𝐻\bar{x}+e_{i}\in H^{\prime}over¯ start_ARG italic_x end_ARG + italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the first vector in the standard basis of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that does not lie on H𝐻Hitalic_H (i.e., e1,,ei1Hsubscript𝑒1subscript𝑒𝑖1𝐻e_{1},\dots,e_{i-1}\in Hitalic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∈ italic_H and eiHsubscript𝑒𝑖𝐻e_{i}\notin Hitalic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ italic_H). Note that this definition does not depend on the choice of x¯¯𝑥\bar{x}over¯ start_ARG italic_x end_ARG. We can then extend this ordering rule to those pairs of n𝑛nitalic_n-dimensional polyhedra of a polyhedral complex in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT that share a facet. Specifically, given a polyhedral complex 𝒫𝒫\mathcal{P}caligraphic_P in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, let P,P′′𝒫nsuperscript𝑃superscript𝑃′′superscript𝒫𝑛P^{\prime},P^{\prime\prime}\in\mathcal{P}^{n}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be such that FPP′′𝒫n1𝐹superscript𝑃superscript𝑃′′superscript𝒫𝑛1F\coloneqq P^{\prime}\cap P^{\prime\prime}\in\mathcal{P}^{n-1}italic_F ≔ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∩ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT. Further, let H𝐻Hitalic_H be the unique hyperplane containing F𝐹Fitalic_F. We say that PP′′precedessuperscript𝑃superscript𝑃′′P^{\prime}\prec P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≺ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT if the halfspace delimited by H𝐻Hitalic_H and containing Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT precedes the halfspace delimited by H𝐻Hitalic_H and containing P′′superscript𝑃′′P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT.

We can now explain the alternate description of the functions in n(G)superscript𝑛𝐺\mathcal{F}^{n}(G)caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ), which is based on the following notion.

Definition 3.4.

Let ϕn(G)italic-ϕsuperscript𝑛𝐺\phi\in\mathcal{F}^{n}(G)italic_ϕ ∈ caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ), with associated polyhedral complex 𝒫𝒫\mathcal{P}caligraphic_P. The facet-function associated with ϕitalic-ϕ\phiitalic_ϕ is the function ψ:𝒫n1G:𝜓superscript𝒫𝑛1𝐺\psi\colon\mathcal{P}^{n-1}\to Gitalic_ψ : caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT → italic_G defined as follows: given F𝒫n1𝐹superscript𝒫𝑛1F\in\mathcal{P}^{n-1}italic_F ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT, let P,P′′superscript𝑃superscript𝑃′′P^{\prime},P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT be the two polyhedra in 𝒫nsuperscript𝒫𝑛\mathcal{P}^{n}caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that F=PP′′𝐹superscript𝑃superscript𝑃′′F=P^{\prime}\cap P^{\prime\prime}italic_F = italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∩ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT, where PP′′precedessuperscript𝑃superscript𝑃′′P^{\prime}\prec P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≺ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT; then we set ψ(F)ϕ(P)ϕ(P′′)𝜓𝐹italic-ϕsuperscript𝑃italic-ϕsuperscript𝑃′′\psi(F)\coloneqq\phi(P^{\prime})-\phi(P^{\prime\prime})italic_ψ ( italic_F ) ≔ italic_ϕ ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_ϕ ( italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ).

Although it will not be used, we observe that knowing ψ𝜓\psiitalic_ψ is sufficient to reconstruct ϕitalic-ϕ\phiitalic_ϕ up to an additive constant. This means that a function ϕn(G)superscriptitalic-ϕsuperscript𝑛𝐺\phi^{\prime}\in\mathcal{F}^{n}(G)italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ) associated with the same polyhedral complex 𝒫𝒫\mathcal{P}caligraphic_P has the same facet-function ψ𝜓\psiitalic_ψ if and only if there exists gG𝑔𝐺g\in Gitalic_g ∈ italic_G such that ϕ(P)ϕ(P)=gitalic-ϕ𝑃superscriptitalic-ϕ𝑃𝑔\phi(P)-\phi^{\prime}(P)=gitalic_ϕ ( italic_P ) - italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_P ) = italic_g for every P𝒫n𝑃superscript𝒫𝑛P\in\mathcal{P}^{n}italic_P ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. (However, it is not true that every function ψ:𝒫n1G:𝜓superscript𝒫𝑛1𝐺\psi\colon\mathcal{P}^{n-1}\to Gitalic_ψ : caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT → italic_G is the facet-function of some function in n(G)superscript𝑛𝐺\mathcal{F}^{n}(G)caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ).)

We now introduce a sum operation over n(G)superscript𝑛𝐺\mathcal{F}^{n}(G)caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ).

Definition 3.5.

For functions ϕ1,,ϕpn(G)subscriptitalic-ϕ1subscriptitalic-ϕ𝑝superscript𝑛𝐺\phi_{1},\dots,\phi_{p}\in\mathcal{F}^{n}(G)italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ) with associated polyhedral complexes 𝒫1,,𝒫psubscript𝒫1subscript𝒫𝑝\mathcal{P}_{1},\dots,\mathcal{P}_{p}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, the sum ϕϕ1++ϕpitalic-ϕsubscriptitalic-ϕ1subscriptitalic-ϕ𝑝\phi\coloneqq\phi_{1}+\dots+\phi_{p}italic_ϕ ≔ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the function in n(G)superscript𝑛𝐺\mathcal{F}^{n}(G)caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ) defined as follows:

  • the polyhedral complex associated with ϕitalic-ϕ\phiitalic_ϕ is

    𝒫{P1PpPi𝒫i for every i};𝒫conditional-setsubscript𝑃1subscript𝑃𝑝subscript𝑃𝑖subscript𝒫𝑖 for every i\mathcal{P}\coloneqq\{P_{1}\cap\dots\cap P_{p}\mid P_{i}\in\mathcal{P}_{i}% \mbox{ for every $i$}\};caligraphic_P ≔ { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for every italic_i } ;
  • given P𝒫n𝑃superscript𝒫𝑛P\in\mathcal{P}^{n}italic_P ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, P𝑃Pitalic_P can be uniquely obtained as P1Ppsubscript𝑃1subscript𝑃𝑝P_{1}\cap\dots\cap P_{p}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, where Pi𝒫insubscript𝑃𝑖subscriptsuperscript𝒫𝑛𝑖P_{i}\in\mathcal{P}^{n}_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for every i𝑖iitalic_i; we then define

    ϕ(P)=i=1pϕi(Pi).italic-ϕ𝑃superscriptsubscript𝑖1𝑝subscriptitalic-ϕ𝑖subscript𝑃𝑖\phi(P)=\sum_{i=1}^{p}\phi_{i}(P_{i}).italic_ϕ ( italic_P ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

The term “sum” is justified by the fact that when 𝒫1==𝒫psubscript𝒫1subscript𝒫𝑝\mathcal{P}_{1}=\dots=\mathcal{P}_{p}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = caligraphic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (and thus ϕ1,,ϕpsubscriptitalic-ϕ1subscriptitalic-ϕ𝑝\phi_{1},\dots,\phi_{p}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT have the same domain) we obtain the standard notion of the sum of functions.

The next results shows how to compute the facet-function of a sum of functions in n(G)superscript𝑛𝐺\mathcal{F}^{n}(G)caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ).

Observation 3.6.

With the notation of Definition 3.5, let ψ1,,ψpsubscript𝜓1subscript𝜓𝑝\psi_{1},\dots,\psi_{p}italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ψ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT be the facet-functions associated with ϕ1,,ϕpsubscriptitalic-ϕ1subscriptitalic-ϕ𝑝\phi_{1},\dots,\phi_{p}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and let ψ𝜓\psiitalic_ψ be the facet-function associated with ϕitalic-ϕ\phiitalic_ϕ. Given F𝒫n1𝐹superscript𝒫𝑛1F\in\mathcal{P}^{n-1}italic_F ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT, let I𝐼Iitalic_I be the set of indices i{1,,p}𝑖1𝑝i\in\{1,\dots,p\}italic_i ∈ { 1 , … , italic_p } such that 𝒫in1superscriptsubscript𝒫𝑖𝑛1\mathcal{P}_{i}^{n-1}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT contains a (unique) element Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with FFi𝐹subscript𝐹𝑖F\subseteq F_{i}italic_F ⊆ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then

ψ(F)=iIψi(Fi).𝜓𝐹subscript𝑖𝐼subscript𝜓𝑖subscript𝐹𝑖\psi(F)=\sum_{i\in I}\psi_{i}(F_{i}).italic_ψ ( italic_F ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (5)
Proof.

Let P,P′′superscript𝑃superscript𝑃′′P^{\prime},P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT be the two polyhedra in 𝒫nsuperscript𝒫𝑛\mathcal{P}^{n}caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that F=PP′′𝐹superscript𝑃superscript𝑃′′F=P^{\prime}\cap P^{\prime\prime}italic_F = italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∩ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT, with PP′′precedessuperscript𝑃superscript𝑃′′P^{\prime}\prec P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≺ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT. We have P=P1Ppsuperscript𝑃subscriptsuperscript𝑃1subscriptsuperscript𝑃𝑝P^{\prime}=P^{\prime}_{1}\cap\dots\cap P^{\prime}_{p}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and P′′=P1′′Pp′′superscript𝑃′′subscriptsuperscript𝑃′′1subscriptsuperscript𝑃′′𝑝P^{\prime\prime}=P^{\prime\prime}_{1}\cap\dots\cap P^{\prime\prime}_{p}italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for a unique choice of Pi,Pi′′𝒫insubscriptsuperscript𝑃𝑖subscriptsuperscript𝑃′′𝑖superscriptsubscript𝒫𝑖𝑛P^{\prime}_{i},P^{\prime\prime}_{i}\in\mathcal{P}_{i}^{n}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for every i𝑖iitalic_i. Then

ψ(F)=ϕ(P)ϕ(P′′)=i=1p(ϕi(Pi)ϕi(Pi′′)).𝜓𝐹italic-ϕsuperscript𝑃italic-ϕsuperscript𝑃′′superscriptsubscript𝑖1𝑝subscriptitalic-ϕ𝑖subscriptsuperscript𝑃𝑖subscriptitalic-ϕ𝑖subscriptsuperscript𝑃′′𝑖\psi(F)=\phi(P^{\prime})-\phi(P^{\prime\prime})=\sum_{i=1}^{p}(\phi_{i}(P^{% \prime}_{i})-\phi_{i}(P^{\prime\prime}_{i})).italic_ψ ( italic_F ) = italic_ϕ ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_ϕ ( italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (6)

Now fix i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ]. Since FPiPi′′𝐹subscriptsuperscript𝑃𝑖subscriptsuperscript𝑃′′𝑖F\subseteq P^{\prime}_{i}\cap P^{\prime\prime}_{i}italic_F ⊆ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, dim(PiPi′′)n1dimensionsubscriptsuperscript𝑃𝑖subscriptsuperscript𝑃′′𝑖𝑛1\dim(P^{\prime}_{i}\cap P^{\prime\prime}_{i})\geq n-1roman_dim ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_n - 1. If dim(PiPi′′)=n1dimensionsubscriptsuperscript𝑃𝑖subscriptsuperscript𝑃′′𝑖𝑛1\dim(P^{\prime}_{i}\cap P^{\prime\prime}_{i})=n-1roman_dim ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_n - 1, then FiPiPi′′𝒫in1subscript𝐹𝑖subscriptsuperscript𝑃𝑖subscriptsuperscript𝑃′′𝑖subscriptsuperscript𝒫𝑛1𝑖F_{i}\coloneqq P^{\prime}_{i}\cap P^{\prime\prime}_{i}\in\mathcal{P}^{n-1}_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ϕi(Pi)ϕi(Pi′′)=ψi(Fi)subscriptitalic-ϕ𝑖subscriptsuperscript𝑃𝑖subscriptitalic-ϕ𝑖subscriptsuperscript𝑃′′𝑖subscript𝜓𝑖subscript𝐹𝑖\phi_{i}(P^{\prime}_{i})-\phi_{i}(P^{\prime\prime}_{i})=\psi_{i}(F_{i})italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Furthermore, iI𝑖𝐼i\in Iitalic_i ∈ italic_I because FFi𝐹subscript𝐹𝑖F\subseteq F_{i}italic_F ⊆ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If, on the contrary, dim(PiPi′′)=ndimensionsubscriptsuperscript𝑃𝑖subscriptsuperscript𝑃′′𝑖𝑛\dim(P^{\prime}_{i}\cap P^{\prime\prime}_{i})=nroman_dim ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_n, the fact that 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a polyhedral complex implies that Pi=Pi′′subscriptsuperscript𝑃𝑖subscriptsuperscript𝑃′′𝑖P^{\prime}_{i}=P^{\prime\prime}_{i}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and thus ϕi(Pi)ϕi(Pi′′)=0subscriptitalic-ϕ𝑖subscriptsuperscript𝑃𝑖subscriptitalic-ϕ𝑖subscriptsuperscript𝑃′′𝑖0\phi_{i}(P^{\prime}_{i})-\phi_{i}(P^{\prime\prime}_{i})=0italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 0. Moreover, in this case iI𝑖𝐼i\notin Iitalic_i ∉ italic_I: this is because PP′′Pisuperscript𝑃superscript𝑃′′subscriptsuperscript𝑃𝑖P^{\prime}\cup P^{\prime\prime}\subseteq P^{\prime}_{i}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ⊆ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which implies that the relative interior of F𝐹Fitalic_F is contained in the relative interior of Pisubscriptsuperscript𝑃𝑖P^{\prime}_{i}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. With these observations, from (6) we obtain (5). ∎

Definition 3.7.

Fix ϕn(G)italic-ϕsuperscript𝑛𝐺\phi\in\mathcal{F}^{n}(G)italic_ϕ ∈ caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ), with associated polyhedral complex 𝒫𝒫\mathcal{P}caligraphic_P. Let H𝐻Hitalic_H be a hyperplane in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and let H,H′′superscript𝐻superscript𝐻′′H^{\prime},H^{\prime\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT be the closed halfspaces delimited by H𝐻Hitalic_H. Define the polyhedral complex

𝒫^={PHP𝒫}{PHP𝒫}{PH′′P𝒫}.^𝒫conditional-set𝑃𝐻𝑃𝒫conditional-set𝑃superscript𝐻𝑃𝒫conditional-set𝑃superscript𝐻′′𝑃𝒫\widehat{\mathcal{P}}=\{P\cap H\mid P\in\mathcal{P}\}\cup\{P\cap H^{\prime}% \mid P\in\mathcal{P}\}\cup\{P\cap H^{\prime\prime}\mid P\in\mathcal{P}\}.over^ start_ARG caligraphic_P end_ARG = { italic_P ∩ italic_H ∣ italic_P ∈ caligraphic_P } ∪ { italic_P ∩ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_P ∈ caligraphic_P } ∪ { italic_P ∩ italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∣ italic_P ∈ caligraphic_P } .

The refinement of ϕitalic-ϕ\phiitalic_ϕ with respect to H𝐻Hitalic_H is the function ϕ^n(G)^italic-ϕsuperscript𝑛𝐺\widehat{\phi}\in\mathcal{F}^{n}(G)over^ start_ARG italic_ϕ end_ARG ∈ caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ) with associated polyhedral complex 𝒫^^𝒫\widehat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG defined as follows: given P^𝒫^n^𝑃superscript^𝒫𝑛\widehat{P}\in\widehat{\mathcal{P}}^{n}over^ start_ARG italic_P end_ARG ∈ over^ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, ϕ^(P^)ϕ(P)^italic-ϕ^𝑃italic-ϕ𝑃\widehat{\phi}(\widehat{P})\coloneqq\phi(P)over^ start_ARG italic_ϕ end_ARG ( over^ start_ARG italic_P end_ARG ) ≔ italic_ϕ ( italic_P ), where P𝑃Pitalic_P is the unique polyhedron in 𝒫𝒫\mathcal{P}caligraphic_P that contains P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG.

The next results shows how to compute the facet-function of a refinement.

Observation 3.8.

With the notation of Definition 3.7, let ψ𝜓\psiitalic_ψ be the facet-function associated with ϕitalic-ϕ\phiitalic_ϕ. Then, the facet-function ψ^^𝜓\widehat{\psi}over^ start_ARG italic_ψ end_ARG associated with ϕ^^italic-ϕ\widehat{\phi}over^ start_ARG italic_ϕ end_ARG is given by

ψ^(F^)={ψ(F)if there exists a (unique) F𝒫n1 containing F^0otherwise,^𝜓^𝐹cases𝜓𝐹if there exists a (unique) F𝒫n1 containing F^0otherwise\widehat{\psi}(\widehat{F})=\begin{cases}\psi(F)&\mbox{if there exists a (% unique) $F\in\mathcal{P}^{n-1}$ containing $\widehat{F}$}\\ 0&\mbox{otherwise},\end{cases}over^ start_ARG italic_ψ end_ARG ( over^ start_ARG italic_F end_ARG ) = { start_ROW start_CELL italic_ψ ( italic_F ) end_CELL start_CELL if there exists a (unique) italic_F ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT containing over^ start_ARG italic_F end_ARG end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise , end_CELL end_ROW

for every F^𝒫^n1^𝐹superscript^𝒫𝑛1\widehat{F}\in\widehat{\mathcal{P}}^{n-1}over^ start_ARG italic_F end_ARG ∈ over^ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT.

Proof.

Let P^,P^′′superscript^𝑃superscript^𝑃′′\widehat{P}^{\prime},\widehat{P}^{\prime\prime}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT be the polyhedra in 𝒫^nsuperscript^𝒫𝑛\widehat{\mathcal{P}}^{n}over^ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that F^=P^P^′′^𝐹superscript^𝑃superscript^𝑃′′\widehat{F}=\widehat{P}^{\prime}\cap\widehat{P}^{\prime\prime}over^ start_ARG italic_F end_ARG = over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∩ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT, with P^P^′′precedessuperscript^𝑃superscript^𝑃′′\widehat{P}^{\prime}\prec\widehat{P}^{\prime\prime}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≺ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT. Further, let P,P′′superscript𝑃superscript𝑃′′P^{\prime},P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT be the unique polyhedra in 𝒫nsuperscript𝒫𝑛\mathcal{P}^{n}caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT that contain P^,P^′′superscript^𝑃superscript^𝑃′′\widehat{P}^{\prime},\widehat{P}^{\prime\prime}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT (respectively). It might happen that P=P′′superscript𝑃superscript𝑃′′P^{\prime}=P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT.

If there is F𝒫n1𝐹superscript𝒫𝑛1F\in\mathcal{P}^{n-1}italic_F ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT containing F^^𝐹\widehat{F}over^ start_ARG italic_F end_ARG, then the fact that 𝒫𝒫\mathcal{P}caligraphic_P is a polyhedral complex implies that F=PP′′𝐹superscript𝑃superscript𝑃′′F=P^{\prime}\cap P^{\prime\prime}italic_F = italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∩ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT. Note that PP′′superscript𝑃superscript𝑃′′P^{\prime}\neq P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT and PP′′precedessuperscript𝑃superscript𝑃′′P^{\prime}\prec P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≺ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT in this case. Thus ψ^(F^)=ϕ^(P^)ϕ^(P^′′)=ϕ(P)ϕ(P′′)=ψ(F)^𝜓^𝐹^italic-ϕsuperscript^𝑃^italic-ϕsuperscript^𝑃′′italic-ϕsuperscript𝑃italic-ϕsuperscript𝑃′′𝜓𝐹\widehat{\psi}(\widehat{F})=\widehat{\phi}(\widehat{P}^{\prime})-\widehat{\phi% }(\widehat{P}^{\prime\prime})=\phi(P^{\prime})-\phi(P^{\prime\prime})=\psi(F)over^ start_ARG italic_ψ end_ARG ( over^ start_ARG italic_F end_ARG ) = over^ start_ARG italic_ϕ end_ARG ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - over^ start_ARG italic_ϕ end_ARG ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_ϕ ( italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) = italic_ψ ( italic_F ).

Assume now that no element of 𝒫n1superscript𝒫𝑛1\mathcal{P}^{n-1}caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT contains F^^𝐹\widehat{F}over^ start_ARG italic_F end_ARG. Then there exists P𝒫n𝑃superscript𝒫𝑛P\in\mathcal{P}^{n}italic_P ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that F^=PH^𝐹𝑃𝐻\widehat{F}=P\cap Hover^ start_ARG italic_F end_ARG = italic_P ∩ italic_H and H𝐻Hitalic_H intersects the interior of P𝑃Pitalic_P. Note that P=P=P′′𝑃superscript𝑃superscript𝑃′′P=P^{\prime}=P^{\prime\prime}italic_P = italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT in this case. Then P^=PHsuperscript^𝑃𝑃superscript𝐻\widehat{P}^{\prime}=P\cap H^{\prime}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_P ∩ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and P^′′=PH′′superscript^𝑃′′𝑃superscript𝐻′′\widehat{P}^{\prime\prime}=P\cap H^{\prime\prime}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = italic_P ∩ italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT (or vice versa). It follows that ψ^(F^)=ϕ^(P^)ϕ^(P^′′)=ϕ(P)ϕ(P)=0^𝜓^𝐹^italic-ϕsuperscript^𝑃^italic-ϕsuperscript^𝑃′′italic-ϕ𝑃italic-ϕ𝑃0\widehat{\psi}(\widehat{F})=\widehat{\phi}(\widehat{P}^{\prime})-\widehat{\phi% }(\widehat{P}^{\prime\prime})=\phi(P)-\phi(P)=0over^ start_ARG italic_ψ end_ARG ( over^ start_ARG italic_F end_ARG ) = over^ start_ARG italic_ϕ end_ARG ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - over^ start_ARG italic_ϕ end_ARG ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_P ) - italic_ϕ ( italic_P ) = 0. ∎

We now prove that the operations of sum and refinement commute: the refinement of a sum is the sum of the refinements.

Observation 3.9.

Let ϕ1,,ϕpn(G)subscriptitalic-ϕ1subscriptitalic-ϕ𝑝superscript𝑛𝐺\phi_{1},\dots,\phi_{p}\in\mathcal{F}^{n}(G)italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ) be p𝑝pitalic_p functions with associated polyhedral complexes 𝒫1,,𝒫psubscript𝒫1subscript𝒫𝑝\mathcal{P}_{1},\dots,\mathcal{P}_{p}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Define ϕϕ1++ϕpitalic-ϕsubscriptitalic-ϕ1subscriptitalic-ϕ𝑝\phi\coloneqq\phi_{1}+\dots+\phi_{p}italic_ϕ ≔ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Let H𝐻Hitalic_H be a hyperplane in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and let H,H′′superscript𝐻superscript𝐻′′H^{\prime},H^{\prime\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT be the closed halfspaces delimited by H𝐻Hitalic_H. Then ϕ^=ϕ^1++ϕ^p^italic-ϕsubscript^italic-ϕ1subscript^italic-ϕ𝑝\widehat{\phi}=\widehat{\phi}_{1}+\dots+\widehat{\phi}_{p}over^ start_ARG italic_ϕ end_ARG = over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

Proof.

Define ϕ~ϕ^1++ϕ^p~italic-ϕsubscript^italic-ϕ1subscript^italic-ϕ𝑝\widetilde{\phi}\coloneqq\widehat{\phi}_{1}+\dots+\widehat{\phi}_{p}over~ start_ARG italic_ϕ end_ARG ≔ over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. It can be verified that ϕ^^italic-ϕ\widehat{\phi}over^ start_ARG italic_ϕ end_ARG and ϕ~~italic-ϕ\widetilde{\phi}over~ start_ARG italic_ϕ end_ARG are defined on the same poyhedral complex, which we denote by P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG. We now fix P^𝒫^n^𝑃superscript^𝒫𝑛\widehat{P}\in\widehat{\mathcal{P}}^{n}over^ start_ARG italic_P end_ARG ∈ over^ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and show that ϕ^(P^)=ϕ~(P^)^italic-ϕ^𝑃~italic-ϕ^𝑃\widehat{\phi}(\widehat{P})=\widetilde{\phi}(\widehat{P})over^ start_ARG italic_ϕ end_ARG ( over^ start_ARG italic_P end_ARG ) = over~ start_ARG italic_ϕ end_ARG ( over^ start_ARG italic_P end_ARG ).

Since P^𝒫^n^𝑃superscript^𝒫𝑛\widehat{P}\in\widehat{\mathcal{P}}^{n}over^ start_ARG italic_P end_ARG ∈ over^ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, it is n𝑛nitalic_n-dimensional and either contained in Hsuperscript𝐻H^{\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT oder H′′superscript𝐻′′H^{\prime\prime}italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT. Since both cases are symmetric, let us focus on P^H^𝑃superscript𝐻\widehat{P}\subseteq H^{\prime}over^ start_ARG italic_P end_ARG ⊆ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This means, we can write it as P^=P1PpH^𝑃subscript𝑃1subscript𝑃𝑝superscript𝐻\widehat{P}=P_{1}\cap\dots\cap P_{p}\cap H^{\prime}over^ start_ARG italic_P end_ARG = italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∩ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where Pi𝒫insubscript𝑃𝑖superscriptsubscript𝒫𝑖𝑛P_{i}\in\mathcal{P}_{i}^{n}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for every i𝑖iitalic_i. Then

ϕ^(P^)=ϕ(P1Pp)=i=1pϕi(Pi)=i=1pϕ^i(PiH)=ϕ~(P1PpH)=ϕ~(P),^italic-ϕ^𝑃italic-ϕsubscript𝑃1subscript𝑃𝑝superscriptsubscript𝑖1𝑝subscriptitalic-ϕ𝑖subscript𝑃𝑖superscriptsubscript𝑖1𝑝subscript^italic-ϕ𝑖subscript𝑃𝑖superscript𝐻~italic-ϕsubscript𝑃1subscript𝑃𝑝superscript𝐻~italic-ϕ𝑃\widehat{\phi}(\widehat{P})=\phi(P_{1}\cap\dots\cap P_{p})=\sum_{i=1}^{p}\phi_% {i}(P_{i})=\sum_{i=1}^{p}\widehat{\phi}_{i}(P_{i}\cap H^{\prime})=\widetilde{% \phi}(P_{1}\cap\dots\cap P_{p}\cap H^{\prime})=\widetilde{\phi}(P),over^ start_ARG italic_ϕ end_ARG ( over^ start_ARG italic_P end_ARG ) = italic_ϕ ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = over~ start_ARG italic_ϕ end_ARG ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∩ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = over~ start_ARG italic_ϕ end_ARG ( italic_P ) ,

where the first and third equations follow from the definition of refinement, while the second and fourth equations follow from the definition of the sum. ∎

The lineality space of a (nonempty) polyhedron P={xnAxb}𝑃conditional-set𝑥superscript𝑛𝐴𝑥𝑏P=\{x\in\mathbb{R}^{n}\mid Ax\leq b\}italic_P = { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_A italic_x ≤ italic_b } is the null space of the constraint matrix A𝐴Aitalic_A. In other words, it is the set of vectors yn𝑦superscript𝑛y\in\mathbb{R}^{n}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that for every xP𝑥𝑃x\in Pitalic_x ∈ italic_P the whole line {x+λyλ}conditional-set𝑥𝜆𝑦𝜆\{x+\lambda y\mid\lambda\in\mathbb{R}\}{ italic_x + italic_λ italic_y ∣ italic_λ ∈ blackboard_R } is a subset of P𝑃Pitalic_P. We say that the lineality space of P𝑃Pitalic_P is trivial, if it contains only the zero vector, and nontrivial otherwise.

Given a polyhedron P𝑃Pitalic_P, it is well-known that all nonempty faces of P𝑃Pitalic_P share the same lineality space. Therefore, given a polyhedral complex 𝒫𝒫\mathcal{P}caligraphic_P that covers nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, all the nonempty polyhedra in 𝒫𝒫\mathcal{P}caligraphic_P share the same lineality space L𝐿Litalic_L. We will call L𝐿Litalic_L the lineality space of 𝒫𝒫\mathcal{P}caligraphic_P.

Lemma 3.10.

Given an abelian group (G,+)𝐺(G,+)( italic_G , + ), pick ϕ1,,ϕpn(G)subscriptitalic-ϕ1subscriptitalic-ϕ𝑝superscript𝑛𝐺\phi_{1},\dots,\phi_{p}\in\mathcal{F}^{n}(G)italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ), with associated polyhedral complexes 𝒫1,,𝒫psubscript𝒫1subscript𝒫𝑝\mathcal{P}_{1},\dots,\mathcal{P}_{p}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Assume that for every i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ] the lineality space of 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is nontrivial. Define ϕϕ1++ϕpitalic-ϕsubscriptitalic-ϕ1subscriptitalic-ϕ𝑝\phi\coloneqq\phi_{1}+\dots+\phi_{p}italic_ϕ ≔ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, 𝒫𝒫\mathcal{P}caligraphic_P as the underlying polyhedral complex, and ψ𝜓\psiitalic_ψ as the facet-function of ϕitalic-ϕ\phiitalic_ϕ. Then for every hyperplane Hn𝐻superscript𝑛H\subseteq\mathbb{R}^{n}italic_H ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the set

S{F𝒫n1FH,ψ(F)0}𝑆conditional-set𝐹superscript𝒫𝑛1formulae-sequence𝐹𝐻𝜓𝐹0S\coloneqq\bigcup\left\{F\in\mathcal{P}^{n-1}\mid F\subseteq H,\,\psi(F)\neq 0\right\}italic_S ≔ ⋃ { italic_F ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ∣ italic_F ⊆ italic_H , italic_ψ ( italic_F ) ≠ 0 }

is either empty or contains a line.

Proof.

The proof is by induction on n𝑛nitalic_n. For n=1𝑛1n=1italic_n = 1, the assumptions imply that all 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are equal to 𝒫𝒫\mathcal{P}caligraphic_P, and each of these polyhedral complexes has \mathbb{R}blackboard_R as its only nonempty face. Since 𝒫n1superscript𝒫𝑛1\mathcal{P}^{n-1}caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT is empty, no hyperplane H𝐻Hitalic_H such that S𝑆S\neq\emptysetitalic_S ≠ ∅ can exist.

Now fix n2𝑛2n\geq 2italic_n ≥ 2. Assume by contradiction that there exists a hyperplane H𝐻Hitalic_H such that S𝑆Sitalic_S is nonempty and contains no line. Let ϕ^^italic-ϕ\widehat{\phi}over^ start_ARG italic_ϕ end_ARG be the refinement of ϕitalic-ϕ\phiitalic_ϕ with respect to H𝐻Hitalic_H, 𝒫^^𝒫\widehat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG be the underlying polyhedral complex, and ψ^^𝜓\widehat{\psi}over^ start_ARG italic_ψ end_ARG be the associated facet-function. Further, we define 𝒬{PHP𝒫^}𝒬conditional-set𝑃𝐻𝑃^𝒫\mathcal{Q}\coloneqq\{P\cap H\mid P\in\widehat{\mathcal{P}}\}caligraphic_Q ≔ { italic_P ∩ italic_H ∣ italic_P ∈ over^ start_ARG caligraphic_P end_ARG }, which is a polyhedral complex that covers H𝐻Hitalic_H. Note that if H𝐻Hitalic_H is identified with n1superscript𝑛1\mathbb{R}^{n-1}blackboard_R start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT then we can think of 𝒬𝒬\mathcal{Q}caligraphic_Q as a polyhedral complex that covers n1superscript𝑛1\mathbb{R}^{n-1}blackboard_R start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT, and the restriction of ψ^^𝜓\widehat{\psi}over^ start_ARG italic_ψ end_ARG to 𝒬n1superscript𝒬𝑛1\mathcal{Q}^{n-1}caligraphic_Q start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT, which we denote by ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, can be seen as a function in n1(G)superscript𝑛1𝐺\mathcal{F}^{n-1}(G)caligraphic_F start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ( italic_G ). We will prove that ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT does not satisfy the lemma, contradicting the inductive hypothesis.

Since ϕ=ϕ1++ϕpitalic-ϕsubscriptitalic-ϕ1subscriptitalic-ϕ𝑝\phi=\phi_{1}+\dots+\phi_{p}italic_ϕ = italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, by 3.9 we have ϕ^=ϕ^1++ϕ^p^italic-ϕsubscript^italic-ϕ1subscript^italic-ϕ𝑝\widehat{\phi}=\widehat{\phi}_{1}+\dots+\widehat{\phi}_{p}over^ start_ARG italic_ϕ end_ARG = over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Note that for every i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ] the hyperplane H𝐻Hitalic_H is covered by the elements of 𝒫^n1superscript^𝒫𝑛1\widehat{\mathcal{P}}^{n-1}over^ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT. This implies that for every F^𝒫^n1^𝐹superscript^𝒫𝑛1\widehat{F}\in\widehat{\mathcal{P}}^{n-1}over^ start_ARG italic_F end_ARG ∈ over^ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT and i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ] there exists F^i𝒫^in1subscript^𝐹𝑖subscriptsuperscript^𝒫𝑛1𝑖\widehat{F}_{i}\in\widehat{\mathcal{P}}^{n-1}_{i}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over^ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that F^F^i^𝐹subscript^𝐹𝑖\widehat{F}\subseteq\widehat{F}_{i}over^ start_ARG italic_F end_ARG ⊆ over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, by 3.6, ψ^(F^)=ψ^1(F^1)++ψ^p(F^p)^𝜓^𝐹subscript^𝜓1subscript^𝐹1subscript^𝜓𝑝subscript^𝐹𝑝\widehat{\psi}(\widehat{F})=\widehat{\psi}_{1}(\widehat{F}_{1})+\dots+\widehat% {\psi}_{p}(\widehat{F}_{p})over^ start_ARG italic_ψ end_ARG ( over^ start_ARG italic_F end_ARG ) = over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ⋯ + over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ).

Now, additionally suppose that F^^𝐹\widehat{F}over^ start_ARG italic_F end_ARG is contained in H𝐻Hitalic_H, that is, F^𝒬n1^𝐹superscript𝒬𝑛1\widehat{F}\in\mathcal{Q}^{n-1}over^ start_ARG italic_F end_ARG ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT. Let i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ] be such that the lineality space of 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is not a subset of the linear space parallel to H𝐻Hitalic_H. Then no element of 𝒫in1superscriptsubscript𝒫𝑖𝑛1\mathcal{P}_{i}^{n-1}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT contains F^isubscript^𝐹𝑖\widehat{F}_{i}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. By 3.8, ψ^i(F^i)=0subscript^𝜓𝑖subscript^𝐹𝑖0\widehat{\psi}_{i}(\widehat{F}_{i})=0over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 0. We then conclude that

ψ^(F^)=iJψ^i(F^i)for every F^𝒬n1,^𝜓^𝐹subscript𝑖𝐽subscript^𝜓𝑖subscript^𝐹𝑖for every F^𝒬n1\widehat{\psi}(\widehat{F})=\sum_{i\in J}\widehat{\psi}_{i}(\widehat{F}_{i})% \quad\mbox{for every $\widehat{F}\in{\mathcal{Q}}^{n-1}$},over^ start_ARG italic_ψ end_ARG ( over^ start_ARG italic_F end_ARG ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_J end_POSTSUBSCRIPT over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for every over^ start_ARG italic_F end_ARG ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ,

where J𝐽Jitalic_J is the set of indices i𝑖iitalic_i such that the lineality space of 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a subset of the linear space parallel to H𝐻Hitalic_H. This means that

ϕ=iJϕi,superscriptitalic-ϕsubscript𝑖𝐽subscriptsuperscriptitalic-ϕ𝑖\phi^{\prime}=\sum_{i\in J}\phi^{\prime}_{i},italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_J end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

where ϕisubscriptsuperscriptitalic-ϕ𝑖\phi^{\prime}_{i}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the restriction of ψ^isubscript^𝜓𝑖\widehat{\psi}_{i}over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 𝒬in1superscriptsubscript𝒬𝑖𝑛1\mathcal{Q}_{i}^{n-1}caligraphic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT, with 𝒬i{PHP𝒫i^}subscript𝒬𝑖conditional-set𝑃𝐻𝑃^subscript𝒫𝑖\mathcal{Q}_{i}\coloneqq\{P\cap H\mid P\in\widehat{\mathcal{P}_{i}}\}caligraphic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ { italic_P ∩ italic_H ∣ italic_P ∈ over^ start_ARG caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG }. Note that for every iJ𝑖𝐽i\in Jitalic_i ∈ italic_J the lineality space of 𝒬isubscript𝒬𝑖\mathcal{Q}_{i}caligraphic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is clearly nontrivial, as it coincides with the lineality space of 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Now pick any F^𝒬n1^𝐹superscript𝒬𝑛1\widehat{F}\in\mathcal{Q}^{n-1}over^ start_ARG italic_F end_ARG ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT. Note that if there exists F𝒫n1𝐹superscript𝒫𝑛1F\in\mathcal{P}^{n-1}italic_F ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT such that F^F^𝐹𝐹\widehat{F}\subseteq Fover^ start_ARG italic_F end_ARG ⊆ italic_F, then F^=F^𝐹𝐹\widehat{F}=Fover^ start_ARG italic_F end_ARG = italic_F. It then follows from 3.8 that

{F^𝒬n1|ψ^(F^)0}=S.conditional-set^𝐹superscript𝒬𝑛1^𝜓^𝐹0𝑆\bigcup\left\{\widehat{F}\in\mathcal{Q}^{n-1}\mathrel{}\middle|\mathrel{}% \widehat{\psi}(\widehat{F})\neq 0\right\}=S.⋃ { over^ start_ARG italic_F end_ARG ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT | over^ start_ARG italic_ψ end_ARG ( over^ start_ARG italic_F end_ARG ) ≠ 0 } = italic_S .

In other words,

{F𝒬n1|ϕ(F)0}=S.conditional-set𝐹superscript𝒬𝑛1superscriptitalic-ϕ𝐹0𝑆\bigcup\left\{F\in\mathcal{Q}^{n-1}\mathrel{}\middle|\mathrel{}\phi^{\prime}(F% )\neq 0\right\}=S.⋃ { italic_F ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT | italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_F ) ≠ 0 } = italic_S . (7)

Since SH𝑆𝐻S\neq Hitalic_S ≠ italic_H (as S𝑆Sitalic_S contains no line), there exists a polyhedron F𝒬n1𝐹superscript𝒬𝑛1F\in\mathcal{Q}^{n-1}italic_F ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT such that FS𝐹𝑆F\subseteq Sitalic_F ⊆ italic_S and F𝐹Fitalic_F has a facet F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT which does not belong to any other polyhedron in 𝒬n1superscript𝒬𝑛1\mathcal{Q}^{n-1}caligraphic_Q start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT contained in S𝑆Sitalic_S. Then the facet-function ψsuperscript𝜓\psi^{\prime}italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT associated with ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT satisfies ψ(F0)0superscript𝜓subscript𝐹00\psi^{\prime}(F_{0})\neq 0italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≠ 0. Let Hsuperscript𝐻H^{\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the (n2)𝑛2(n-2)( italic_n - 2 )-dimensional affine space containing F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Then the set

S{F𝒬n2|FH,ψ(F)0}superscript𝑆conditional-set𝐹superscript𝒬𝑛2formulae-sequence𝐹superscript𝐻superscript𝜓𝐹0S^{\prime}\coloneqq\bigcup\left\{F\in\mathcal{Q}^{n-2}\mathrel{}\middle|% \mathrel{}F\subseteq H^{\prime},\,\psi^{\prime}(F)\neq 0\right\}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≔ ⋃ { italic_F ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_n - 2 end_POSTSUPERSCRIPT | italic_F ⊆ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_F ) ≠ 0 }

is nonempty, as F0Ssubscript𝐹0superscript𝑆F_{0}\subseteq S^{\prime}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊆ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Furthermore, we claim that Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT contains no line. To see why this is true, take any F𝒬n2𝐹superscript𝒬𝑛2F\in\mathcal{Q}^{n-2}italic_F ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_n - 2 end_POSTSUPERSCRIPT such that FH𝐹superscript𝐻F\subseteq H^{\prime}italic_F ⊆ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and ψ(F)0superscript𝜓𝐹0\psi^{\prime}(F)\neq 0italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_F ) ≠ 0, and let F,F′′superscript𝐹superscript𝐹′′F^{\prime},F^{\prime\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT be the two polyhedra in 𝒬n1superscript𝒬𝑛1\mathcal{Q}^{n-1}caligraphic_Q start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT having F𝐹Fitalic_F as facet. Then ϕ(F)ϕ(F′′)superscriptitalic-ϕsuperscript𝐹superscriptitalic-ϕsuperscript𝐹′′\phi^{\prime}(F^{\prime})\neq\phi^{\prime}(F^{\prime\prime})italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≠ italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_F start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ), and thus at least one of these values (say ϕ(F)superscriptitalic-ϕsuperscript𝐹\phi^{\prime}(F^{\prime})italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )) is nonzero. Then, by (7), FSsuperscript𝐹𝑆F^{\prime}\subseteq Sitalic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ italic_S, and thus also FS𝐹𝑆F\subseteq Sitalic_F ⊆ italic_S. This shows that SSsuperscript𝑆𝑆S^{\prime}\subseteq Sitalic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ italic_S and therefore Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT contains no line.

We have shown that ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT does not satisfy the lemma. This contradicts the inductive assumption that the lemma holds in dimension n1𝑛1n-1italic_n - 1. ∎

Finally, we can use this lemma to prove Proposition 3.2.

Proof of Proposition 3.2.

Assume for the sake of a contradiction that

f0(x)=i=1pλimax{i1(x),,in(x)}for every xn,subscript𝑓0𝑥superscriptsubscript𝑖1𝑝subscript𝜆𝑖subscript𝑖1𝑥subscript𝑖𝑛𝑥for every xnf_{0}(x)=\sum_{i=1}^{p}\lambda_{i}\max\{\ell_{i1}(x),\dots,\ell_{in}(x)\}\quad% \mbox{for every $x\in\mathbb{R}^{n}$},italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_max { roman_ℓ start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ( italic_x ) , … , roman_ℓ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_x ) } for every italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ,

where p𝑝p\in{\mathbb{N}}italic_p ∈ blackboard_N, λ1,,λpsubscript𝜆1subscript𝜆𝑝\lambda_{1},\dots,\lambda_{p}\in\mathbb{R}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R and ij:n:subscript𝑖𝑗superscript𝑛\ell_{ij}\colon\mathbb{R}^{n}\to\mathbb{R}roman_ℓ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R is an affine linear function for every i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ] and j[n]𝑗delimited-[]𝑛j\in[n]italic_j ∈ [ italic_n ]. Define fi(x)λimax{i1(x),,in(x)}subscript𝑓𝑖𝑥subscript𝜆𝑖subscript𝑖1𝑥subscript𝑖𝑛𝑥f_{i}(x)\coloneqq\lambda_{i}\max\{\ell_{i1}(x),\dots,\ell_{in}(x)\}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ≔ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_max { roman_ℓ start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ( italic_x ) , … , roman_ℓ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_x ) } for every i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ], which is a CPWL function.

Fix any i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ] such that λi0subscript𝜆𝑖0\lambda_{i}\geq 0italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0. Then fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is convex. Note that its epigraph

Ei{(x,z)n×zij(x) for j[n]}subscript𝐸𝑖conditional-set𝑥𝑧superscript𝑛𝑧subscript𝑖𝑗𝑥 for j[n]E_{i}\coloneqq\{(x,z)\in\mathbb{R}^{n}\times\mathbb{R}\mid z\geq\ell_{ij}(x)% \mbox{ for $j\in[n]$}\}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ { ( italic_x , italic_z ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R ∣ italic_z ≥ roman_ℓ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x ) for italic_j ∈ [ italic_n ] }

is a polyhedron in n+1superscript𝑛1\mathbb{R}^{n+1}blackboard_R start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT defined by n𝑛nitalic_n inequalities, and thus has nontrivial lineality space. Furthermore, no line orthogonal to the x𝑥xitalic_x-space is contained in Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since the underlying polyhedral complex 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of the orthogonal projections of the faces of Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (excluding Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT itself) onto the x𝑥xitalic_x-space, this implies that 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has also nontrivial lineality space. (More precisely, the lineality space of 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the projection of the lineality space of Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.)

If λi<0subscript𝜆𝑖0\lambda_{i}<0italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 0, then fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is concave. By arguing as above on the convex function fisubscript𝑓𝑖-f_{i}- italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, one obtains that the underlying polyhedral complex 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has again nontrivial lineality space. Thus this property holds for every i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ].

The set of affine linear functions nsuperscript𝑛\mathbb{R}^{n}\to\mathbb{R}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R forms an abelian group (with respect to the standard operation of sum of functions), which we denote by (G,+)𝐺(G,+)( italic_G , + ). For every i[p]0𝑖subscriptdelimited-[]𝑝0i\in[p]_{0}italic_i ∈ [ italic_p ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, let ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the function in n(G)superscript𝑛𝐺\mathcal{F}^{n}(G)caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_G ) with underlying polyhedral complex 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT defined as follows: for every P𝒫in𝑃superscriptsubscript𝒫𝑖𝑛P\in\mathcal{P}_{i}^{n}italic_P ∈ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, ϕi(P)subscriptitalic-ϕ𝑖𝑃\phi_{i}(P)italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P ) is the affine linear function that coincides with fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over P𝑃Pitalic_P. Define ϕϕ1++ϕpitalic-ϕsubscriptitalic-ϕ1subscriptitalic-ϕ𝑝\phi\coloneqq\phi_{1}+\dots+\phi_{p}italic_ϕ ≔ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and let 𝒫𝒫\mathcal{P}caligraphic_P be the underlying polyhedral complex.

Note that for every P𝒫n𝑃superscript𝒫𝑛P\in\mathcal{P}^{n}italic_P ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, ϕ(P)italic-ϕ𝑃\phi(P)italic_ϕ ( italic_P ) is precisely the affine linear function that coincides with f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT within P𝑃Pitalic_P. However, 𝒫𝒫\mathcal{P}caligraphic_P may not coincide with 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, as there might exist P,P′′𝒫dsuperscript𝑃superscript𝑃′′superscript𝒫𝑑P^{\prime},P^{\prime\prime}\in\mathcal{P}^{d}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT sharing a facet such that ϕ(P)=ϕ(P′′)italic-ϕsuperscript𝑃italic-ϕsuperscript𝑃′′\phi(P^{\prime})=\phi(P^{\prime\prime})italic_ϕ ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ); when this happens, f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is affine linear over PP′′superscript𝑃superscript𝑃′′P^{\prime}\cup P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT and therefore Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and P′′superscript𝑃′′P^{\prime\prime}italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT are merged together in 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Nonetheless, 𝒫𝒫\mathcal{P}caligraphic_P is a refinement of 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e., for every P𝒫0n𝑃superscriptsubscript𝒫0𝑛P\in\mathcal{P}_{0}^{n}italic_P ∈ caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT there exist P1,,Pk𝒫nsubscript𝑃1subscript𝑃𝑘superscript𝒫𝑛P_{1},\dots,P_{k}\in\mathcal{P}^{n}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (for some k1𝑘1k\geq 1italic_k ≥ 1) such that P=P1Pk𝑃subscript𝑃1subscript𝑃𝑘P=P_{1}\cup\dots\cup P_{k}italic_P = italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ ⋯ ∪ italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Moreover, ϕ0(P)=ϕ(P1)==ϕ(Pk)subscriptitalic-ϕ0𝑃italic-ϕsubscript𝑃1italic-ϕsubscript𝑃𝑘\phi_{0}(P)=\phi(P_{1})=\dots=\phi(P_{k})italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_P ) = italic_ϕ ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = ⋯ = italic_ϕ ( italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Denoting by ψ𝜓\psiitalic_ψ the facet-function associated with ϕitalic-ϕ\phiitalic_ϕ, this implies for a facet F𝒫n1𝐹superscript𝒫𝑛1F\in\mathcal{P}^{n-1}italic_F ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT that ψ(F)=0𝜓𝐹0\psi(F)=0italic_ψ ( italic_F ) = 0 if and only if F𝐹Fitalic_F is not subset of any facet F𝒫0n1superscript𝐹superscriptsubscript𝒫0𝑛1F^{\prime}\in\mathcal{P}_{0}^{n-1}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT.

Let H𝐻Hitalic_H be a hyperplane as in the statement of the proposition. The above discussion shows that

T={F𝒫0n1|FH}={F𝒫n1|FH,ψ(F)0}.𝑇conditional-set𝐹superscriptsubscript𝒫0𝑛1𝐹𝐻conditional-set𝐹superscript𝒫𝑛1formulae-sequence𝐹𝐻𝜓𝐹0T=\bigcup\left\{F\in\mathcal{P}_{0}^{n-1}\mathrel{}\middle|\mathrel{}F% \subseteq H\right\}=\bigcup\left\{F\in\mathcal{P}^{n-1}\mathrel{}\middle|% \mathrel{}F\subseteq H,\,\psi(F)\neq 0\right\}.italic_T = ⋃ { italic_F ∈ caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT | italic_F ⊆ italic_H } = ⋃ { italic_F ∈ caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT | italic_F ⊆ italic_H , italic_ψ ( italic_F ) ≠ 0 } .

Using ST𝑆𝑇S\coloneqq Titalic_S ≔ italic_T, we obtain a contradiction to Lemma 3.10. ∎

4 A Width Bound for Neural Networks with Small Depth

While the proof of Theorem 1.1 by Arora et al. [6] shows that

CPWLn=ReLUn(log2(n+1)),subscriptCPWL𝑛subscriptReLU𝑛subscript2𝑛1\operatorname{CPWL}_{n}=\operatorname{ReLU}_{n}(\lceil\log_{2}(n+1)\rceil),roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_ReLU start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n + 1 ) ⌉ ) ,

it does not provide any bound on the width of the NN required to represent any particular CPWL function. The purpose of this section is to prove that for fixed dimension n𝑛nitalic_n, the required width for exact, depth-minimal representation of a CPWL function can be polynomially bounded in the number p𝑝pitalic_p of affine pieces; specifically by p𝒪(n2)superscript𝑝𝒪superscript𝑛2p^{\mathcal{O}(n^{2})}italic_p start_POSTSUPERSCRIPT caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT. This improves previous bounds by He et al. [35] and is closely related to works that bound the number of linear pieces of an NN as a function of the size [55, 59, 61, 54]. It can also be seen as a counterpart, in the context of exact representations, to quantitative universal approximation theorems that bound the number of neurons required to achieve a certain approximation guarantee; see, e.g., [8, 9, 60, 52, 53].

4.1 The Convex Case

We first derive our result for the case of convex CPWL functions and then use this to also prove the general nonconvex case. Our width bound is a consequence of the following theorem about convex CPWL functions, for which we are going to provide a geometric proof later.

Theorem 4.1.

Let f(x)=max{aiTx+bii[p]}𝑓𝑥superscriptsubscript𝑎𝑖𝑇𝑥conditionalsubscript𝑏𝑖𝑖delimited-[]𝑝f(x)=\max\{a_{i}^{T}x+b_{i}\mid i\in[p]\}italic_f ( italic_x ) = roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] } be a convex CPWL function with p𝑝pitalic_p pieces defined on nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Then f𝑓fitalic_f can be written as

f(x)=S[p],|S|n+1cSmax{aiTx+biiS}𝑓𝑥subscript𝑆delimited-[]𝑝𝑆𝑛1subscript𝑐𝑆superscriptsubscript𝑎𝑖𝑇𝑥conditionalsubscript𝑏𝑖𝑖𝑆f(x)=\sum_{\begin{subarray}{c}S\subseteq[p],\\ \lvert S\rvert\leq n+1\end{subarray}}c_{S}\max\{a_{i}^{T}x+b_{i}\mid i\in S\}italic_f ( italic_x ) = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_S ⊆ [ italic_p ] , end_CELL end_ROW start_ROW start_CELL | italic_S | ≤ italic_n + 1 end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ italic_S }

with coefficients cSsubscript𝑐𝑆c_{S}\in{\mathbb{Z}}italic_c start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ blackboard_Z.

For the convex case, this yields a stronger version of Theorem 1.3, stating that any (not necessarily convex) CPWL function can be written as a linear combination of (n+1)𝑛1(n+1)( italic_n + 1 )-term maxima. Theorem 4.1 is stronger in the sense that it guarantees that all pieces of the (n+1)𝑛1(n+1)( italic_n + 1 )-term maxima must be pieces of the original function. This makes it possible to bound the total number of these (n+1)𝑛1(n+1)( italic_n + 1 )-term maxima and, therefore, the size of an NN representing f𝑓fitalic_f, as we will see in the proof of the following theorem.

Theorem 4.2.

Let f:n:𝑓superscript𝑛f\colon\mathbb{R}^{n}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R be a convex CPWL function with p𝑝pitalic_p affine pieces. Then f𝑓fitalic_f can be represented by a ReLU NN with depth log2(n+1)+1subscript2𝑛11\lceil\log_{2}(n+1)\rceil+1⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n + 1 ) ⌉ + 1 and width 𝒪(pn+1)𝒪superscript𝑝𝑛1\mathcal{O}(p^{n+1})caligraphic_O ( italic_p start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ).

Proof.

Using the representation of Theorem 4.1, we can construct an NN computing f𝑓fitalic_f by computing all the (n+1)𝑛1(n+1)( italic_n + 1 )-term max functions in parallel with the construction of Lemma 1.2 (similar to the proof by Arora et al. [6] to show Theorem 1.1). This results in an NN with the claimed depth. Moreover, the width is at most a constant times the number of these (n+1)𝑛1(n+1)( italic_n + 1 )-term max functions. This number can be bounded in terms of the number of possible subsets S[p]𝑆delimited-[]𝑝S\subseteq[p]italic_S ⊆ [ italic_p ] with |S|n+1𝑆𝑛1\lvert S\rvert\leq n+1| italic_S | ≤ italic_n + 1, which is at most pn+1superscript𝑝𝑛1p^{n+1}italic_p start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT. ∎

Before we present the proof of Theorem 4.1, we show how we can generalize its consequences to the nonconvex case.

4.2 The General (Nonconvex) Case

It is a well-known fact that every CPWL function can be expressed as a difference of two convex CPWL functions, see, e.g., [73, Theorem 1]. This allows us to derive the general case from the convex case. What we need, however, is to bound the number of affine pieces of the two convex CPWL functions in terms of the number of pieces of the original function. Therefore, we consider a specific decomposition for which such bounds can easily be achieved.

Proposition 4.3.

Let f:n:𝑓superscript𝑛f\colon\mathbb{R}^{n}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R be a CPWL function with p𝑝pitalic_p affine pieces. Then, one can write f𝑓fitalic_f as f=gh𝑓𝑔f=g-hitalic_f = italic_g - italic_h where both g𝑔gitalic_g and hhitalic_h are convex CPWL functions with at most p2n+1superscript𝑝2𝑛1p^{2n+1}italic_p start_POSTSUPERSCRIPT 2 italic_n + 1 end_POSTSUPERSCRIPT pieces.

Proof.

Suppose the p𝑝pitalic_p affine pieces of f𝑓fitalic_f are given by xaiTx+bimaps-to𝑥superscriptsubscript𝑎𝑖𝑇𝑥subscript𝑏𝑖x\mapsto a_{i}^{T}x+b_{i}italic_x ↦ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ]. Define the function h(x)1i<jpmax{aiTx+bi,ajTx+bj}𝑥subscript1𝑖𝑗𝑝superscriptsubscript𝑎𝑖𝑇𝑥subscript𝑏𝑖superscriptsubscript𝑎𝑗𝑇𝑥subscript𝑏𝑗h(x)\coloneqq\sum_{1\leq i<j\leq p}\max\{a_{i}^{T}x+b_{i},a_{j}^{T}x+b_{j}\}italic_h ( italic_x ) ≔ ∑ start_POSTSUBSCRIPT 1 ≤ italic_i < italic_j ≤ italic_p end_POSTSUBSCRIPT roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and let gf+h𝑔𝑓g\coloneqq f+hitalic_g ≔ italic_f + italic_h. Then, obviously, f=gh𝑓𝑔f=g-hitalic_f = italic_g - italic_h. It remains to show that both g𝑔gitalic_g and hhitalic_h are convex CPWL functions with at most p2n+1superscript𝑝2𝑛1p^{2n+1}italic_p start_POSTSUPERSCRIPT 2 italic_n + 1 end_POSTSUPERSCRIPT pieces.

The convexity of hhitalic_h is clear by definition. Consider the (p2)=p(p1)2<p2binomial𝑝2𝑝𝑝12superscript𝑝2{p\choose 2}=\frac{p(p-1)}{2}<p^{2}( binomial start_ARG italic_p end_ARG start_ARG 2 end_ARG ) = divide start_ARG italic_p ( italic_p - 1 ) end_ARG start_ARG 2 end_ARG < italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT hyperplanes given by aiTx+bi=ajTx+bjsuperscriptsubscript𝑎𝑖𝑇𝑥subscript𝑏𝑖superscriptsubscript𝑎𝑗𝑇𝑥subscript𝑏𝑗a_{i}^{T}x+b_{i}=a_{j}^{T}x+b_{j}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 1i<jp1𝑖𝑗𝑝1\leq i<j\leq p1 ≤ italic_i < italic_j ≤ italic_p. They divide nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT into at most (p2n)+(p2n1)++(p20)p2nbinomialsuperscript𝑝2𝑛binomialsuperscript𝑝2𝑛1binomialsuperscript𝑝20superscript𝑝2𝑛{p^{2}\choose n}+{p^{2}\choose n-1}+\dots+{p^{2}\choose 0}\leq p^{2n}( binomial start_ARG italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG ) + ( binomial start_ARG italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n - 1 end_ARG ) + ⋯ + ( binomial start_ARG italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 0 end_ARG ) ≤ italic_p start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT regions (compare [20, Theorem 1.3]) in each of which hhitalic_h is affine. In particular, hhitalic_h has at most p2np2n+1superscript𝑝2𝑛superscript𝑝2𝑛1p^{2n}\leq p^{2n+1}italic_p start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ≤ italic_p start_POSTSUPERSCRIPT 2 italic_n + 1 end_POSTSUPERSCRIPT pieces.

Next, we show that g=f+h𝑔𝑓g=f+hitalic_g = italic_f + italic_h is convex. Intuitively, this holds because each possible breaking hyperplane of f𝑓fitalic_f is made convex by adding hhitalic_h. To make this formal, note that by the definition of convexity, it suffices to show that g𝑔gitalic_g is convex along each affine line. For this purpose, consider an arbitrary line x(t)=ta+b𝑥𝑡𝑡𝑎𝑏x(t)=ta+bitalic_x ( italic_t ) = italic_t italic_a + italic_b, t𝑡t\in\mathbb{R}italic_t ∈ blackboard_R, given by an𝑎superscript𝑛a\in\mathbb{R}^{n}italic_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and b𝑏b\in\mathbb{R}italic_b ∈ blackboard_R. Let f~(t)f(x(t))~𝑓𝑡𝑓𝑥𝑡\tilde{f}(t)\coloneqq f(x(t))over~ start_ARG italic_f end_ARG ( italic_t ) ≔ italic_f ( italic_x ( italic_t ) ), g~(t)g(x(t))~𝑔𝑡𝑔𝑥𝑡\tilde{g}(t)\coloneqq g(x(t))over~ start_ARG italic_g end_ARG ( italic_t ) ≔ italic_g ( italic_x ( italic_t ) ), and h~(t)h(x(t))~𝑡𝑥𝑡\tilde{h}(t)\coloneqq h(x(t))over~ start_ARG italic_h end_ARG ( italic_t ) ≔ italic_h ( italic_x ( italic_t ) ). We need to show that g~::~𝑔\tilde{g}\colon\mathbb{R}\to\mathbb{R}over~ start_ARG italic_g end_ARG : blackboard_R → blackboard_R is a convex function. Observe that f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG, g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG, and h~~\tilde{h}over~ start_ARG italic_h end_ARG are clearly one-dimensional CPWL functions with the property g~=f~+h~~𝑔~𝑓~\tilde{g}=\tilde{f}+\tilde{h}over~ start_ARG italic_g end_ARG = over~ start_ARG italic_f end_ARG + over~ start_ARG italic_h end_ARG. Hence, it suffices to show that g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG is locally convex around each of its breakpoints. Let t𝑡t\in\mathbb{R}italic_t ∈ blackboard_R be an arbitrary breakpoint of g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG. If f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG is already locally convex around t𝑡titalic_t, then the same holds for g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG as well since h~~\tilde{h}over~ start_ARG italic_h end_ARG inherits convexity from hhitalic_h. Now suppose that t𝑡titalic_t is a nonconvex breakpoint of f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG. Then there exist two distinct pieces of f𝑓fitalic_f, indexed by i,j[p]𝑖𝑗delimited-[]𝑝i,j\in[p]italic_i , italic_j ∈ [ italic_p ] with ij𝑖𝑗i\neq jitalic_i ≠ italic_j, such that f~(t)=min{aiTx(t)+bi,ajTx(t)+bj}~𝑓superscript𝑡superscriptsubscript𝑎𝑖𝑇𝑥superscript𝑡subscript𝑏𝑖superscriptsubscript𝑎𝑗𝑇𝑥superscript𝑡subscript𝑏𝑗\tilde{f}(t^{\prime})=\min\{a_{i}^{T}x(t^{\prime})+b_{i},a_{j}^{T}x(t^{\prime}% )+b_{j}\}over~ start_ARG italic_f end_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_min { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } for all tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT sufficiently close to t𝑡titalic_t. By construction, h~(t)~superscript𝑡\tilde{h}(t^{\prime})over~ start_ARG italic_h end_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) contains the summand max{aiTx(t)+bi,ajTx(t)+bj}superscriptsubscript𝑎𝑖𝑇𝑥superscript𝑡subscript𝑏𝑖superscriptsubscript𝑎𝑗𝑇𝑥superscript𝑡subscript𝑏𝑗\max\{a_{i}^{T}x(t^{\prime})+b_{i},a_{j}^{T}x(t^{\prime})+b_{j}\}roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }. Thus, adding this summand to f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG linearizes the nonconvex breakpoint of f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG, while adding all the other summands preserves convexity. In total, g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG is locally convex around t𝑡titalic_t, which finishes the proof that g𝑔gitalic_g is a convex function.

Finally, observe that pieces of g=f+h𝑔𝑓g=f+hitalic_g = italic_f + italic_h are always intersections of pieces of f𝑓fitalic_f and hhitalic_h, for which we have only pp2n=p2n+1𝑝superscript𝑝2𝑛superscript𝑝2𝑛1p\cdot p^{2n}=p^{2n+1}italic_p ⋅ italic_p start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT = italic_p start_POSTSUPERSCRIPT 2 italic_n + 1 end_POSTSUPERSCRIPT possibilities. ∎

Having this, we may conclude the following.

See 1.9

Proof.

Consider the decomposition f=gh𝑓𝑔f=g-hitalic_f = italic_g - italic_h from Proposition 4.3. Using Theorem 4.2, we obtain that both g𝑔gitalic_g and hhitalic_h can be represented with the required depth log2(n+1)+1subscript2𝑛11\lceil\log_{2}(n+1)\rceil+1⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n + 1 ) ⌉ + 1 and with width 𝒪((p2n+1)n+1)=𝒪(p2n2+3n+1)𝒪superscriptsuperscript𝑝2𝑛1𝑛1𝒪superscript𝑝2superscript𝑛23𝑛1\mathcal{O}((p^{2n+1})^{n+1})=\mathcal{O}(p^{2n^{2}+3n+1})caligraphic_O ( ( italic_p start_POSTSUPERSCRIPT 2 italic_n + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) = caligraphic_O ( italic_p start_POSTSUPERSCRIPT 2 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 italic_n + 1 end_POSTSUPERSCRIPT ). Thus, the same holds true for f𝑓fitalic_f. ∎

4.3 Extended Newton Polyhedra of Convex CPWL Functions

For our proof of Theorem 4.1, we use a correspondence of convex CPWL functions with certain polyhedra, which are known as (extended) Newton polyhedra in tropical geometry [49]. These relations between tropical geometry and neural networks have previously been applied to investigate expressivity of NNs; compare our references in Section 1.5.

In order to formalize this correspondence, let CCPWLnCPWLnsubscriptCCPWL𝑛subscriptCPWL𝑛\operatorname{CCPWL}_{n}\subseteq\operatorname{CPWL}_{n}roman_CCPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊆ roman_CPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the set of convex CPWL functions of type nsuperscript𝑛\mathbb{R}^{n}\to\mathbb{R}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R. For f(x)=max{aiTx+bii[p]}𝑓𝑥superscriptsubscript𝑎𝑖𝑇𝑥conditionalsubscript𝑏𝑖𝑖delimited-[]𝑝f(x)=\max\{a_{i}^{T}x+b_{i}\mid i\in[p]\}italic_f ( italic_x ) = roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] } in CCPWLnsubscriptCCPWL𝑛\operatorname{CCPWL}_{n}roman_CCPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we define its so-called extended Newton polyhedron to be

𝒩(f)conv({(aiT,bi)Tn×i[p]})+cone({en+1})n+1,𝒩𝑓convconditional-setsuperscriptsuperscriptsubscript𝑎𝑖𝑇subscript𝑏𝑖𝑇superscript𝑛𝑖delimited-[]𝑝conesubscript𝑒𝑛1superscript𝑛1\mathcal{N}(f)\coloneqq\operatorname{conv}(\{(a_{i}^{T},b_{i})^{T}\in\mathbb{R% }^{n}\times\mathbb{R}\mid i\in[p]\})+\operatorname{cone}(\{-e_{n+1}\})% \subseteq\mathbb{R}^{n+1},caligraphic_N ( italic_f ) ≔ roman_conv ( { ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R ∣ italic_i ∈ [ italic_p ] } ) + roman_cone ( { - italic_e start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT } ) ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ,

where the “+” stands for Minkowski addition. We denote the set of all possible extended Newton polyhedra in n+1superscript𝑛1\mathbb{R}^{n+1}blackboard_R start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT as NewtnsubscriptNewt𝑛\operatorname{Newt}_{n}roman_Newt start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. That is, NewtnsubscriptNewt𝑛\operatorname{Newt}_{n}roman_Newt start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the set of (unbounded) polyhedra in n+1superscript𝑛1\mathbb{R}^{n+1}blackboard_R start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT that emerge from a polytope by adding the negative of the (n+1)𝑛1(n+1)( italic_n + 1 )-st unit vector en+1subscript𝑒𝑛1-e_{n+1}- italic_e start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT as an extreme ray. Hence, a set Pn+1𝑃superscript𝑛1P\subseteq\mathbb{R}^{n+1}italic_P ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT is an element of NewtnsubscriptNewt𝑛\operatorname{Newt}_{n}roman_Newt start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT if and only if P𝑃Pitalic_P can be written as

P=conv({(aiT,bi)Tn×i[p]})+cone({en+1}).𝑃convconditional-setsuperscriptsuperscriptsubscript𝑎𝑖𝑇subscript𝑏𝑖𝑇superscript𝑛𝑖delimited-[]𝑝conesubscript𝑒𝑛1P=\operatorname{conv}(\{(a_{i}^{T},b_{i})^{T}\in\mathbb{R}^{n}\times\mathbb{R}% \mid i\in[p]\})+\operatorname{cone}(\{-e_{n+1}\}).italic_P = roman_conv ( { ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R ∣ italic_i ∈ [ italic_p ] } ) + roman_cone ( { - italic_e start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT } ) .

Conversely, for a polyhedron PNewtn𝑃subscriptNewt𝑛P\in\operatorname{Newt}_{n}italic_P ∈ roman_Newt start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of this form, let (P)CCPWLn𝑃subscriptCCPWL𝑛\mathcal{F}(P)\in\operatorname{CCPWL}_{n}caligraphic_F ( italic_P ) ∈ roman_CCPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the function defined by (P)(x)=max{aiTx+bii[p]}𝑃𝑥superscriptsubscript𝑎𝑖𝑇𝑥conditionalsubscript𝑏𝑖𝑖delimited-[]𝑝\mathcal{F}(P)(x)=\max\{a_{i}^{T}x+b_{i}\mid i\in[p]\}caligraphic_F ( italic_P ) ( italic_x ) = roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] }.

There is an intuitive way of thinking about the extended Newton polyhedron P𝑃Pitalic_P of a convex CPWL function f𝑓fitalic_f: it consists of all hyperplane coefficients (aT,b)Tn×superscriptsuperscript𝑎𝑇𝑏𝑇superscript𝑛(a^{T},b)^{T}\in\mathbb{R}^{n}\times\mathbb{R}( italic_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_b ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R such that aTx+bf(x)superscript𝑎𝑇𝑥𝑏𝑓𝑥a^{T}x+b\leq f(x)italic_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b ≤ italic_f ( italic_x ) for all xn𝑥superscript𝑛x\in\mathbb{R}^{n}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. This also explains why we add the extreme ray en+1subscript𝑒𝑛1-e_{n+1}- italic_e start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT: decreasing b𝑏bitalic_b obviously maintains the property of aTx+bsuperscript𝑎𝑇𝑥𝑏a^{T}x+bitalic_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b being a lower bound on the function f𝑓fitalic_f. Hence, if a point (aT,b)Tsuperscriptsuperscript𝑎𝑇𝑏𝑇(a^{T},b)^{T}( italic_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_b ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT belongs to the extended Newton polyhedron P𝑃Pitalic_P, then also all points (aT,b)Tsuperscriptsuperscript𝑎𝑇superscript𝑏𝑇(a^{T},b^{\prime})^{T}( italic_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT with b<bsuperscript𝑏𝑏b^{\prime}<bitalic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_b should belong to it. Thus, en+1subscript𝑒𝑛1-e_{n+1}- italic_e start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT should be contained in the recession cone of P𝑃Pitalic_P.

In fact, there is a one-to-one correspondence between elements of CCPWLnsubscriptCCPWL𝑛\operatorname{CCPWL}_{n}roman_CCPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and NewtnsubscriptNewt𝑛\operatorname{Newt}_{n}roman_Newt start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which is nicely compatible with some (functional and polyhedral) operations. This correspondence has been studied before in tropical geometry [49, 41], convex geometry222𝒩(f)𝒩𝑓\mathcal{N}(f)caligraphic_N ( italic_f ) is the negative of the epigraph of the convex conjugate of f𝑓fitalic_f. [39], as well as neural network literature [76, 15, 2, 54]. We summarize the key findings about this correspondence relevant to our work in the following proposition:

Proposition 4.4.

Let n𝑛n\in{\mathbb{N}}italic_n ∈ blackboard_N and f1,f2CCPWLnsubscript𝑓1subscript𝑓2subscriptCCPWL𝑛f_{1},f_{2}\in\operatorname{CCPWL}_{n}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_CCPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Then it holds that

  1. (i)

    the functions 𝒩:CCPWLnNewtn:𝒩subscriptCCPWL𝑛subscriptNewt𝑛\mathcal{N}\colon\operatorname{CCPWL}_{n}\to\operatorname{Newt}_{n}caligraphic_N : roman_CCPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → roman_Newt start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and :NewtnCCPWLn:subscriptNewt𝑛subscriptCCPWL𝑛\mathcal{F}\colon\operatorname{Newt}_{n}\to\operatorname{CCPWL}_{n}caligraphic_F : roman_Newt start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → roman_CCPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are well-defined, that is, their output is independent from the representation of the input by pieces or vertices, respectively,

  2. (ii)

    𝒩𝒩\mathcal{N}caligraphic_N and \mathcal{F}caligraphic_F are bijections and inverse to each other,

  3. (iii)

    𝒩(max{f1,f2})=conv(𝒩(f1),𝒩(f2))conv(𝒩(f1)𝒩(f2))𝒩subscript𝑓1subscript𝑓2conv𝒩subscript𝑓1𝒩subscript𝑓2conv𝒩subscript𝑓1𝒩subscript𝑓2\mathcal{N}(\max\{f_{1},f_{2}\})=\operatorname{conv}(\mathcal{N}(f_{1}),% \mathcal{N}(f_{2}))\coloneqq\operatorname{conv}(\mathcal{N}(f_{1})\cup\mathcal% {N}(f_{2}))caligraphic_N ( roman_max { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ) = roman_conv ( caligraphic_N ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , caligraphic_N ( italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ≔ roman_conv ( caligraphic_N ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∪ caligraphic_N ( italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ),

  4. (iv)

    𝒩(f1+f2)=𝒩(f1)+𝒩(f2)𝒩subscript𝑓1subscript𝑓2𝒩subscript𝑓1𝒩subscript𝑓2\mathcal{N}(f_{1}+f_{2})=\mathcal{N}(f_{1})+\mathcal{N}(f_{2})caligraphic_N ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + caligraphic_N ( italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where the +++ on the right-hand side is Minkowski addition.

An algebraic way of phrasing this proposition is as follows: 𝒩𝒩\mathcal{N}caligraphic_N and \mathcal{F}caligraphic_F are isomorphisms between the semirings (CCPWLn,max,+)subscriptCCPWL𝑛(\operatorname{CCPWL}_{n},\max,+)( roman_CCPWL start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_max , + ) and (Newtn,conv,+)subscriptNewt𝑛conv(\operatorname{Newt}_{n},\operatorname{conv},+)( roman_Newt start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_conv , + ).

4.4 Proof of Theorem 4.1

The rough idea to prove Theorem 4.1 is as follows. Suppose we have a p𝑝pitalic_p-term max function f𝑓fitalic_f with pn+2𝑝𝑛2p\geq n+2italic_p ≥ italic_n + 2. By Proposition 4.4, f𝑓fitalic_f corresponds to a polyhedron PNewtn𝑃subscriptNewt𝑛P\in\operatorname{Newt}_{n}italic_P ∈ roman_Newt start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with at least n+2𝑛2n+2italic_n + 2 vertices. Applying a classical result from discrete geometry known as Radon’s theorem allows us to carefully decompose P𝑃Pitalic_P into a “signed”333Some polyhedra may occur with “negative” coefficents in that sum, meaning that they are actually added to P𝑃Pitalic_P instead of the other polyhedra. The corresponding CPWL functions will then have negative coefficients in the linear combination representing f𝑓fitalic_f. Minkowski sum of polyhedra in NewtnsubscriptNewt𝑛\operatorname{Newt}_{n}roman_Newt start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT whose vertices are subsets of at most p1𝑝1p-1italic_p - 1 out of the p𝑝pitalic_p vertices of P𝑃Pitalic_P. Translating this back into the world of CPWL functions by Proposition 4.4 yields that f𝑓fitalic_f can be written as linear combination of psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-term maxima with p<psuperscript𝑝𝑝p^{\prime}<pitalic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_p, where each of them involves a subset of the p𝑝pitalic_p affine terms of f𝑓fitalic_f. We can then obtain Theorem 4.1 by iterating until every occurring maximum expression involves at most n+1𝑛1n+1italic_n + 1 terms.

We start with a proposition that will be useful for our proof of Theorem 4.1. Although its statement is well-known in the discrete geometry community, we include a proof for the sake of completeness. To show the proposition, we make use of Radon’s theorem (compare [20, Theorem 4.1]), stating that any set of at least n+2𝑛2n+2italic_n + 2 points in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT can be partitioned into two nonempty subsets such that their convex hulls intersect.

Proposition 4.5.

Given p>n+1𝑝𝑛1p>n+1italic_p > italic_n + 1 vectors zi=(aiT,bi)Tn+1subscript𝑧𝑖superscriptsuperscriptsubscript𝑎𝑖𝑇subscript𝑏𝑖𝑇superscript𝑛1z_{i}=(a_{i}^{T},b_{i})^{T}\in\mathbb{R}^{n+1}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT, i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ], there exists a nonempty subset U[p]𝑈delimited-[]𝑝U\subsetneq[p]italic_U ⊊ [ italic_p ] featuring the following property: there is no cn+1𝑐superscript𝑛1c\in\mathbb{R}^{n+1}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT with cn+10subscript𝑐𝑛10c_{n+1}\geq 0italic_c start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ≥ 0 and γ𝛾\gamma\in\mathbb{R}italic_γ ∈ blackboard_R such that

cTzi>γfor all iU, andcTziγfor all i[p]U.superscript𝑐𝑇subscript𝑧𝑖𝛾for all iU, andsuperscript𝑐𝑇subscript𝑧𝑖𝛾for all i[p]U.\displaystyle\begin{split}c^{T}z_{i}&>\gamma\quad\text{for all~{}$i\in U$, and% }\\ c^{T}z_{i}&\leq\gamma\quad\text{for all~{}$i\in[p]\setminus U$.}\end{split}start_ROW start_CELL italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL > italic_γ for all italic_i ∈ italic_U , and end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL ≤ italic_γ for all italic_i ∈ [ italic_p ] ∖ italic_U . end_CELL end_ROW (8)
Proof.

Radon’s theorem applied to the at least n+2𝑛2n+2italic_n + 2 vectors aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ], yields a nonempty subset U[p]𝑈delimited-[]𝑝U\subsetneq[p]italic_U ⊊ [ italic_p ] and coefficients λi[0,1]subscript𝜆𝑖01\lambda_{i}\in[0,1]italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] with iUλi=i[p]Uλi=1subscript𝑖𝑈subscript𝜆𝑖subscript𝑖delimited-[]𝑝𝑈subscript𝜆𝑖1\sum_{i\in U}\lambda_{i}=\sum_{i\in[p]\setminus U}\lambda_{i}=1∑ start_POSTSUBSCRIPT italic_i ∈ italic_U end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_p ] ∖ italic_U end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 such that iUλiai=i[p]Uλiaisubscript𝑖𝑈subscript𝜆𝑖subscript𝑎𝑖subscript𝑖delimited-[]𝑝𝑈subscript𝜆𝑖subscript𝑎𝑖\sum_{i\in U}\lambda_{i}a_{i}=\sum_{i\in[p]\setminus U}\lambda_{i}a_{i}∑ start_POSTSUBSCRIPT italic_i ∈ italic_U end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_p ] ∖ italic_U end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Suppose that iUλibii[p]Uλibisubscript𝑖𝑈subscript𝜆𝑖subscript𝑏𝑖subscript𝑖delimited-[]𝑝𝑈subscript𝜆𝑖subscript𝑏𝑖\sum_{i\in U}\lambda_{i}b_{i}\leq\sum_{i\in[p]\setminus U}\lambda_{i}b_{i}∑ start_POSTSUBSCRIPT italic_i ∈ italic_U end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_p ] ∖ italic_U end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT without loss of generality (otherwise exchange the roles of U𝑈Uitalic_U and [p]Udelimited-[]𝑝𝑈[p]\setminus U[ italic_p ] ∖ italic_U).

For any c𝑐citalic_c and γ𝛾\gammaitalic_γ that satisfy (8) and cn+10subscript𝑐𝑛10c_{n+1}\geq 0italic_c start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ≥ 0 it follows that

γ<cTiUλizicTi[p]Uλiziγ,𝛾superscript𝑐𝑇subscript𝑖𝑈subscript𝜆𝑖subscript𝑧𝑖superscript𝑐𝑇subscript𝑖delimited-[]𝑝𝑈subscript𝜆𝑖subscript𝑧𝑖𝛾\gamma<c^{T}\sum_{i\in U}\lambda_{i}z_{i}\leq c^{T}\sum_{i\in[p]\setminus U}% \lambda_{i}z_{i}\leq\gamma,italic_γ < italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_U end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_p ] ∖ italic_U end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_γ ,

proving that no such c𝑐citalic_c and γ𝛾\gammaitalic_γ can exist. ∎

The following proposition is a crucial step in order to show that any convex CPWL function with p>n+1𝑝𝑛1p>n+1italic_p > italic_n + 1 pieces can be expressed as an integer linear combination of convex CPWL functions with at most p1𝑝1p-1italic_p - 1 pieces.

Proposition 4.6.

Let f(x)=max{aiTx+bii[p]}𝑓𝑥superscriptsubscript𝑎𝑖𝑇𝑥conditionalsubscript𝑏𝑖𝑖delimited-[]𝑝f(x)=\max\{a_{i}^{T}x+b_{i}\mid i\in[p]\}italic_f ( italic_x ) = roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] } be a convex CPWL function defined on nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with p>n+1𝑝𝑛1p>n+1italic_p > italic_n + 1. Then there exist a subset U[p]𝑈delimited-[]𝑝U\subseteq[p]italic_U ⊆ [ italic_p ] such that

WU,|W| evenmax{aiTx+bii[p]W}=WU,|W| oddmax{aiTx+bii[p]W}subscript𝑊𝑈𝑊 evensuperscriptsubscript𝑎𝑖𝑇𝑥conditionalsubscript𝑏𝑖𝑖delimited-[]𝑝𝑊subscript𝑊𝑈𝑊 oddsuperscriptsubscript𝑎𝑖𝑇𝑥conditionalsubscript𝑏𝑖𝑖delimited-[]𝑝𝑊\sum_{\begin{subarray}{c}W\subseteq U,\\ \lvert W\rvert\text{ even}\end{subarray}}\max\{a_{i}^{T}x+b_{i}\mid i\in[p]% \setminus W\}=\sum_{\begin{subarray}{c}W\subseteq U,\\ \lvert W\rvert\text{ odd}\end{subarray}}\max\{a_{i}^{T}x+b_{i}\mid i\in[p]% \setminus W\}∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_W ⊆ italic_U , end_CELL end_ROW start_ROW start_CELL | italic_W | even end_CELL end_ROW end_ARG end_POSTSUBSCRIPT roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] ∖ italic_W } = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_W ⊆ italic_U , end_CELL end_ROW start_ROW start_CELL | italic_W | odd end_CELL end_ROW end_ARG end_POSTSUBSCRIPT roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] ∖ italic_W } (9)
Proof.

Consider the p>n+1𝑝𝑛1p>n+1italic_p > italic_n + 1 vectors zi(aiT,bi)Tn+1subscript𝑧𝑖superscriptsuperscriptsubscript𝑎𝑖𝑇subscript𝑏𝑖𝑇superscript𝑛1z_{i}\coloneqq(a_{i}^{T},b_{i})^{T}\in\mathbb{R}^{n+1}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT, i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ]. Choose U𝑈Uitalic_U according to Proposition 4.5. We show that this choice of U𝑈Uitalic_U guarantees equation (9).

For WU𝑊𝑈W\subseteq Uitalic_W ⊆ italic_U, let fW(x)=max{aiTx+bii[p]W}subscript𝑓𝑊𝑥superscriptsubscript𝑎𝑖𝑇𝑥conditionalsubscript𝑏𝑖𝑖delimited-[]𝑝𝑊f_{W}(x)=\max\{a_{i}^{T}x+b_{i}\mid i\in[p]\setminus W\}italic_f start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( italic_x ) = roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] ∖ italic_W } and consider its extended Newton polyhedron PW=𝒩(fW)=conv({zii[p]W})+cone({en+1})subscript𝑃𝑊𝒩subscript𝑓𝑊convconditional-setsubscript𝑧𝑖𝑖delimited-[]𝑝𝑊conesubscript𝑒𝑛1P_{W}=\mathcal{N}(f_{W})=\operatorname{conv}(\{z_{i}\mid i\in[p]\setminus W\})% +\operatorname{cone}(\{-e_{n+1}\})italic_P start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = caligraphic_N ( italic_f start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) = roman_conv ( { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] ∖ italic_W } ) + roman_cone ( { - italic_e start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT } ). By Proposition 4.4, equation (9) is equivalent to

PevenWU,|W| evenPW=WU,|W| oddPWPodd,subscript𝑃evensubscript𝑊𝑈𝑊 evensubscript𝑃𝑊subscript𝑊𝑈𝑊 oddsubscript𝑃𝑊subscript𝑃oddP_{\operatorname{even}}\coloneqq\sum_{\begin{subarray}{c}W\subseteq U,\\ \lvert W\rvert\text{ even}\end{subarray}}P_{W}=\sum_{\begin{subarray}{c}W% \subseteq U,\\ \lvert W\rvert\text{ odd}\end{subarray}}P_{W}\eqqcolon P_{\operatorname{odd}},italic_P start_POSTSUBSCRIPT roman_even end_POSTSUBSCRIPT ≔ ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_W ⊆ italic_U , end_CELL end_ROW start_ROW start_CELL | italic_W | even end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_W ⊆ italic_U , end_CELL end_ROW start_ROW start_CELL | italic_W | odd end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ≕ italic_P start_POSTSUBSCRIPT roman_odd end_POSTSUBSCRIPT ,

where the sums are Minkowski sums.

We show this equation by showing that for all vectors cn+1𝑐superscript𝑛1c\in\mathbb{R}^{n+1}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT it holds that

max{cTxxPeven}=max{cTxxPodd}.conditionalsuperscript𝑐𝑇𝑥𝑥subscript𝑃evenconditionalsuperscript𝑐𝑇𝑥𝑥subscript𝑃odd\max\{c^{T}x\mid x\in P_{\operatorname{even}}\}=\max\{c^{T}x\mid x\in P_{% \operatorname{odd}}\}.roman_max { italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_x ∈ italic_P start_POSTSUBSCRIPT roman_even end_POSTSUBSCRIPT } = roman_max { italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_x ∈ italic_P start_POSTSUBSCRIPT roman_odd end_POSTSUBSCRIPT } . (10)

Let cn+1𝑐superscript𝑛1c\in\mathbb{R}^{n+1}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT be an arbitrary vector. If cn+1<0subscript𝑐𝑛10c_{n+1}<0italic_c start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT < 0, both sides of (10) are infinite. Hence, from now on, assume that cn+10subscript𝑐𝑛10c_{n+1}\geq 0italic_c start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ≥ 0. Then, both sides of (10) are finite since en+1subscript𝑒𝑛1-e_{n+1}- italic_e start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT is the only extreme ray of all involved polyhedra.

Due to our choice of U𝑈Uitalic_U according to Proposition 4.5, there exists an index uU𝑢𝑈u\in Uitalic_u ∈ italic_U such that

cTzumaxi[p]UcTzi.superscript𝑐𝑇subscript𝑧𝑢subscript𝑖delimited-[]𝑝𝑈superscript𝑐𝑇subscript𝑧𝑖c^{T}z_{u}\leq\max_{i\in[p]\setminus U}c^{T}z_{i}.italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ≤ roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_p ] ∖ italic_U end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (11)

We define a bijection φusubscript𝜑𝑢\varphi_{u}italic_φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT between the even and the odd subsets of U𝑈Uitalic_U as follows:

φu(W){W{u},if uW,W{u},if uW.subscript𝜑𝑢𝑊cases𝑊𝑢if 𝑢𝑊𝑊𝑢if 𝑢𝑊\varphi_{u}(W)\coloneqq\left\{\begin{array}[]{ll}W\cup\{u\},&\text{if }u\notin W% ,\\ W\setminus\{u\},&\text{if }u\in W.\\ \end{array}\right.italic_φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_W ) ≔ { start_ARRAY start_ROW start_CELL italic_W ∪ { italic_u } , end_CELL start_CELL if italic_u ∉ italic_W , end_CELL end_ROW start_ROW start_CELL italic_W ∖ { italic_u } , end_CELL start_CELL if italic_u ∈ italic_W . end_CELL end_ROW end_ARRAY

That is, φusubscript𝜑𝑢\varphi_{u}italic_φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT changes the parity of W𝑊Witalic_W by adding or removing u𝑢uitalic_u. Considering the corresponding polyhedra PWsubscript𝑃𝑊P_{W}italic_P start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and Pφu(W)subscript𝑃subscript𝜑𝑢𝑊P_{\varphi_{u}(W)}italic_P start_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_W ) end_POSTSUBSCRIPT, this means that φusubscript𝜑𝑢\varphi_{u}italic_φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT adds or removes the extreme point zusubscript𝑧𝑢z_{u}italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT to or from PWsubscript𝑃𝑊P_{W}italic_P start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT. Due to (11) this does not change the optimal value of maximizing in c𝑐citalic_c-direction over the polyhedra, that is,

max{cTxxPW}=max{cTxxPφu(W)}.conditionalsuperscript𝑐𝑇𝑥𝑥subscript𝑃𝑊conditionalsuperscript𝑐𝑇𝑥𝑥subscript𝑃subscript𝜑𝑢𝑊\max\{c^{T}x\mid x\in P_{W}\}=\max\{c^{T}x\mid x\in P_{\varphi_{u}(W)}\}.roman_max { italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_x ∈ italic_P start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT } = roman_max { italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_x ∈ italic_P start_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_W ) end_POSTSUBSCRIPT } .

Hence, we may conclude

max{cTxxPeven}conditionalsuperscript𝑐𝑇𝑥𝑥subscript𝑃even\displaystyle\max\{c^{T}x\mid x\in P_{\operatorname{even}}\}roman_max { italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_x ∈ italic_P start_POSTSUBSCRIPT roman_even end_POSTSUBSCRIPT } =WU,|W| evenmax{cTxxPW}absentsubscript𝑊𝑈𝑊 evenconditionalsuperscript𝑐𝑇𝑥𝑥subscript𝑃𝑊\displaystyle=\sum_{\begin{subarray}{c}W\subseteq U,\\ \lvert W\rvert\text{ even}\end{subarray}}\max\{c^{T}x\mid x\in P_{W}\}= ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_W ⊆ italic_U , end_CELL end_ROW start_ROW start_CELL | italic_W | even end_CELL end_ROW end_ARG end_POSTSUBSCRIPT roman_max { italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_x ∈ italic_P start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT }
=WU,|W| evenmax{cTxxPφu(W)}absentsubscript𝑊𝑈𝑊 evenconditionalsuperscript𝑐𝑇𝑥𝑥subscript𝑃subscript𝜑𝑢𝑊\displaystyle=\sum_{\begin{subarray}{c}W\subseteq U,\\ \lvert W\rvert\text{ even}\end{subarray}}\max\{c^{T}x\mid x\in P_{\varphi_{u}(% W)}\}= ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_W ⊆ italic_U , end_CELL end_ROW start_ROW start_CELL | italic_W | even end_CELL end_ROW end_ARG end_POSTSUBSCRIPT roman_max { italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_x ∈ italic_P start_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_W ) end_POSTSUBSCRIPT }
=WU,|W| oddmax{cTxxPW}absentsubscript𝑊𝑈𝑊 oddconditionalsuperscript𝑐𝑇𝑥𝑥subscript𝑃𝑊\displaystyle=\sum_{\begin{subarray}{c}W\subseteq U,\\ \lvert W\rvert\text{ odd}\end{subarray}}\max\{c^{T}x\mid x\in P_{W}\}= ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_W ⊆ italic_U , end_CELL end_ROW start_ROW start_CELL | italic_W | odd end_CELL end_ROW end_ARG end_POSTSUBSCRIPT roman_max { italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_x ∈ italic_P start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT }
=max{cTxxPodd},absentconditionalsuperscript𝑐𝑇𝑥𝑥subscript𝑃odd\displaystyle=\max\{c^{T}x\mid x\in P_{\operatorname{odd}}\},= roman_max { italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_x ∈ italic_P start_POSTSUBSCRIPT roman_odd end_POSTSUBSCRIPT } ,

which proves (10). Thus, the claim follows. ∎

With the help of this result, we can now prove Theorem 4.1.

Proof of Theorem 4.1.

Let f(x)=max{aiTx+bii[p]}𝑓𝑥superscriptsubscript𝑎𝑖𝑇𝑥conditionalsubscript𝑏𝑖𝑖delimited-[]𝑝f(x)=\max\{a_{i}^{T}x+b_{i}\mid i\in[p]\}italic_f ( italic_x ) = roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] } be a convex CPWL function defined on nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Having a closer look at the statement of Proposition 4.6, observe that only one term at the left-hand side of (9) contains all p𝑝pitalic_p affine combinations aiTx+bisuperscriptsubscript𝑎𝑖𝑇𝑥subscript𝑏𝑖a_{i}^{T}x+b_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Putting all other maximum terms on the other side, we may write f𝑓fitalic_f as an integer linear combination of maxima of at most p1𝑝1p-1italic_p - 1 summands. Repeating this procedure until we have eliminated all maximum terms with more than n+1𝑛1n+1italic_n + 1 summands yields the desired representation. ∎

4.5 Potential Approaches to Show Lower Bounds on the Width

In light of the upper width bounds shown in this section, a natural question to ask is whether also meaningful lower bounds can be achieved. This would mean constructing a family of CPWL functions with p𝑝pitalic_p pieces defined on nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (with different values of p𝑝pitalic_p and n𝑛nitalic_n), for which we can prove that a large width is required to represent these functions with NNs of depth log2(n+1)+1subscript2𝑛11\lceil\log_{2}(n+1)\rceil+1⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n + 1 ) ⌉ + 1.

A trivial and not very satisfying answer follows, e.g., from [61] oder [67]: for fixed input dimension n𝑛nitalic_n, they show that a function computed by an NN with k𝑘kitalic_k hidden layers and width w𝑤witalic_w has at most 𝒪(wkn)𝒪superscript𝑤𝑘𝑛\mathcal{O}(w^{kn})caligraphic_O ( italic_w start_POSTSUPERSCRIPT italic_k italic_n end_POSTSUPERSCRIPT ) pieces. For our setting, this means that an NN with logarithmic depth needs a width of at least 𝒪(p1/(nlogn))𝒪superscript𝑝1𝑛𝑛\mathcal{O}(p^{1/(n\log n)})caligraphic_O ( italic_p start_POSTSUPERSCRIPT 1 / ( italic_n roman_log italic_n ) end_POSTSUPERSCRIPT ) to represent a function with p𝑝pitalic_p pieces. This is, of course, very far away from our upper bounds.

Similar upper bounds on the number of pieces have been proven by many other authors and are often used to show depth-width trade-offs [55, 54, 59, 70, 6]. However, there is a good reason why all these results only give rise to very trivial lower bounds for our setting: the focus is always on functions with considerably many pieces, which then, consequently, need many neurons to be represented (with small depth). However, since the lower bounds we strive for depend on the number of pieces, we would need to construct a family of functions with comparably few pieces that still need a lot of neurons to be represented. In general, it seems to be a tough task to argue why such functions should exist.

A different approach could leverage methods from complexity theory, in particular from circuit complexity. Neural networks are basically arithmetic circuits with very special operations allowed. In fact, they can be seen as a tropical variant of arithmetic circuits. Showing circuit lower bounds is a notoriously difficult task in complexity theory, but maybe some conditional result (based on common conjectures similar to P \neq NP) could be established.

We think that the question whether our bounds are tight, or whether at least some non-trivial lower bounds on the width for NNs with logarithmic depth can be shown, is an exciting question for further research.

5 Understanding Expressivity via Newton Polytopes

In Section 2, we presented a mixed-integer programming approach towards proving that deep NNs can strictly represent more functions than shallow ones. However, even if we could prove that it is indeed enough to consider H𝐻Hitalic_H-conforming NNs, this approach would not generalize to deeper networks due to computational limitations. Therefore, different ideas are needed to prove 1.4 in its full generality. In this section, we point out that Newton polytopes of convex CPWL functions (similar to what we used in the previous section) could also be a way of proving 1.4. Using a homogenized version of Proposition 4.4, we provide an equivalent formulation of 1.4 that is completely phrased in the language of discrete geometry.

Recall that, by Proposition 2.3, we may restrict ourselves to NNs without biases. In particular, all CPWL functions represented by such NNs, or parts of it, are positively homogeneous. For the associated extended Newton polyhedra (compare Proposition 4.4), this has the following consequence: all vertices (a,b)n×𝑎𝑏superscript𝑛(a,b)\in\mathbb{R}^{n}\times\mathbb{R}( italic_a , italic_b ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R lie in the hyperplane b=0𝑏0b=0italic_b = 0, that is, their (n+1)𝑛1(n+1)( italic_n + 1 )-st coordinate is 00. Therefore, the extended Newton polyhedron of a positively homogeneous, convex CPWL function f(x)=max{aiTxi[p]}𝑓𝑥conditionalsuperscriptsubscript𝑎𝑖𝑇𝑥𝑖delimited-[]𝑝f(x)=\max\{a_{i}^{T}x\mid i\in[p]\}italic_f ( italic_x ) = roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_i ∈ [ italic_p ] } is completely characterized by the so-called Newton polytope, that is, the polytope conv({aii[p]})nconvconditional-setsubscript𝑎𝑖𝑖delimited-[]𝑝superscript𝑛\operatorname{conv}(\{a_{i}\mid i\in[p]\})\subseteq\mathbb{R}^{n}roman_conv ( { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] } ) ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

To make this formal, let CCPWL¯nsubscript¯CCPWL𝑛\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{CCPWL}\mkern-1.5mu}\mkern 1.5% mu_{n}over¯ start_ARG roman_CCPWL end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the set of all positively homogeneous, convex CPWL functions of type nsuperscript𝑛\mathbb{R}^{n}\to\mathbb{R}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R and let Newt¯nsubscript¯Newt𝑛\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the set of all convex polytopes in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Moreover, for f(x)=max{aiTxi[p]}𝑓𝑥conditionalsuperscriptsubscript𝑎𝑖𝑇𝑥𝑖delimited-[]𝑝f(x)=\max\{a_{i}^{T}x\mid i\in[p]\}italic_f ( italic_x ) = roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_i ∈ [ italic_p ] } in CCPWL¯nsubscript¯CCPWL𝑛\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{CCPWL}\mkern-1.5mu}\mkern 1.5% mu_{n}over¯ start_ARG roman_CCPWL end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, let

𝒩¯(f)conv({aii[p]})Newt¯n¯𝒩𝑓convconditional-setsubscript𝑎𝑖𝑖delimited-[]𝑝subscript¯Newt𝑛\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(f)% \coloneqq\operatorname{conv}(\{a_{i}\mid i\in[p]\})\in\mkern 1.5mu\overline{% \mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu_{n}over¯ start_ARG caligraphic_N end_ARG ( italic_f ) ≔ roman_conv ( { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] } ) ∈ over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

be the associated Newton polytope of f𝑓fitalic_f and for P=conv({aii[p]})Newt¯n𝑃convconditional-setsubscript𝑎𝑖𝑖delimited-[]𝑝subscript¯Newt𝑛P=\operatorname{conv}(\{a_{i}\mid i\in[p]\})\in\mkern 1.5mu\overline{\mkern-1.% 5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu_{n}italic_P = roman_conv ( { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ italic_p ] } ) ∈ over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT let

¯(P)(x)=max{aiTxi[p]}¯𝑃𝑥conditionalsuperscriptsubscript𝑎𝑖𝑇𝑥𝑖delimited-[]𝑝\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{F}\mkern-1.5mu}\mkern 1.5mu(P)(x)=% \max\{a_{i}^{T}x\mid i\in[p]\}over¯ start_ARG caligraphic_F end_ARG ( italic_P ) ( italic_x ) = roman_max { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x ∣ italic_i ∈ [ italic_p ] }

be the so-called associated support function [38] of P𝑃Pitalic_P in CCPWL¯nsubscript¯CCPWL𝑛\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{CCPWL}\mkern-1.5mu}\mkern 1.5% mu_{n}over¯ start_ARG roman_CCPWL end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. With this notation, we obtain the following variant of Proposition 4.4.

Proposition 5.1.

Let n𝑛n\in{\mathbb{N}}italic_n ∈ blackboard_N and f1,f2CCPWL¯nsubscript𝑓1subscript𝑓2subscript¯CCPWL𝑛f_{1},f_{2}\in\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{CCPWL}\mkern-1.5% mu}\mkern 1.5mu_{n}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ over¯ start_ARG roman_CCPWL end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Then it holds that

  1. (i)

    the functions 𝒩¯:CCPWL¯nNewt¯n:¯𝒩subscript¯CCPWL𝑛subscript¯Newt𝑛\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu\colon% \mkern 1.5mu\overline{\mkern-1.5mu\operatorname{CCPWL}\mkern-1.5mu}\mkern 1.5% mu_{n}\to\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}% \mkern 1.5mu_{n}over¯ start_ARG caligraphic_N end_ARG : over¯ start_ARG roman_CCPWL end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and ¯:Newt¯nCCPWL¯n:¯subscript¯Newt𝑛subscript¯CCPWL𝑛\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{F}\mkern-1.5mu}\mkern 1.5mu\colon% \mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}\to\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{CCPWL}\mkern-1.5mu}% \mkern 1.5mu_{n}over¯ start_ARG caligraphic_F end_ARG : over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → over¯ start_ARG roman_CCPWL end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are well-defined, that is, their output is independent from the representation of the input by pieces or vertices, respectively,

  2. (ii)

    𝒩¯¯𝒩\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5muover¯ start_ARG caligraphic_N end_ARG and ¯¯\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{F}\mkern-1.5mu}\mkern 1.5muover¯ start_ARG caligraphic_F end_ARG are bijections and inverse to each other,

  3. (iii)

    𝒩¯(max{f1,f2})=conv(𝒩¯(f1),𝒩¯(f2))conv(𝒩¯(f1)𝒩¯(f2))¯𝒩subscript𝑓1subscript𝑓2conv¯𝒩subscript𝑓1¯𝒩subscript𝑓2conv¯𝒩subscript𝑓1¯𝒩subscript𝑓2\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(\max\{f% _{1},f_{2}\})=\operatorname{conv}(\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N% }\mkern-1.5mu}\mkern 1.5mu(f_{1}),\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N% }\mkern-1.5mu}\mkern 1.5mu(f_{2}))\coloneqq\operatorname{conv}(\mkern 1.5mu% \overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(f_{1})\cup\mkern 1.% 5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(f_{2}))over¯ start_ARG caligraphic_N end_ARG ( roman_max { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ) = roman_conv ( over¯ start_ARG caligraphic_N end_ARG ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_N end_ARG ( italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ≔ roman_conv ( over¯ start_ARG caligraphic_N end_ARG ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∪ over¯ start_ARG caligraphic_N end_ARG ( italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ),

  4. (iv)

    𝒩¯(f1+f2)=𝒩¯(f1)+𝒩¯(f2)¯𝒩subscript𝑓1subscript𝑓2¯𝒩subscript𝑓1¯𝒩subscript𝑓2\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(f_{1}+f% _{2})=\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(f% _{1})+\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(f% _{2})over¯ start_ARG caligraphic_N end_ARG ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = over¯ start_ARG caligraphic_N end_ARG ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + over¯ start_ARG caligraphic_N end_ARG ( italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where the +++ on the right-hand side is Minkowski addition.

In other words, 𝒩¯¯𝒩\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5muover¯ start_ARG caligraphic_N end_ARG and ¯¯\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{F}\mkern-1.5mu}\mkern 1.5muover¯ start_ARG caligraphic_F end_ARG are isomorphisms between the semirings (CCPWL¯n,max,+)subscript¯CCPWL𝑛(\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{CCPWL}\mkern-1.5mu}\mkern 1.5% mu_{n},\max,+)( over¯ start_ARG roman_CCPWL end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_max , + ) and (Newt¯n,conv,+)subscript¯Newt𝑛conv(\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5% mu_{n},\operatorname{conv},+)( over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_conv , + ).

Next, we study which polytopes can appear as Newton polytopes of convex CPWL functions computed by NNs with a certain depth; compare Zhang et al. [76].

Before we apply the first ReLU activation, any function computed by an NN is linear. Thus, the corresponding Newton polytope is a single point. Starting from that, let us investigate a neuron in the first hidden layer. Here, the ReLU activation function computes a maximum of a linear function and 00. Therefore, the Newton polytope of the resulting function is the convex hull of two points, that is, a line segment. After the first hidden layer, arbitrary many functions of this type can be added up. For the corresponding Newton polytopes, this means that we take the Minkowski sum of line segments, resulting in a so-called zonotope.

Now, this construction can be repeated layerwise, making use of Proposition 5.1: in each hidden layer, we can compute the maximum of two functions computed by the previous layers, which translates to obtaining the new Newton polytope as a convex hull of the union of the two original Newton polytopes. In addition, the linear combinations between layers translate to scaling and taking Minkowski sums of Newton polytopes.

This intuition motivates the following definition. Let Newt¯n(0)superscriptsubscript¯Newt𝑛0\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(0)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT be the set of all polytopes in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT that consist only of a single point. Then, for each k1𝑘1k\geq 1italic_k ≥ 1, we recursively define

Newt¯n(k){i=1pconv(Pi,Qi)|Pi,QiNewt¯n(k1),p},superscriptsubscript¯Newt𝑛𝑘conditional-setsuperscriptsubscript𝑖1𝑝convsubscript𝑃𝑖subscript𝑄𝑖formulae-sequencesubscript𝑃𝑖subscript𝑄𝑖superscriptsubscript¯Newt𝑛𝑘1𝑝\displaystyle\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu% }\mkern 1.5mu_{n}^{(k)}\coloneqq\left\{\sum_{i=1}^{p}\operatorname{conv}(P_{i}% ,Q_{i})\mathrel{}\middle|\mathrel{}P_{i},Q_{i}\in\mkern 1.5mu\overline{\mkern-% 1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu_{n}^{(k-1)},\ p\in{\mathbb{N% }}\right\},over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ≔ { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT roman_conv ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , italic_p ∈ blackboard_N } ,

where the sum is a Minkowski sum of polytopes. A first, but not precisely accurate interpretation is as follows: the set Newt¯n(k)superscriptsubscript¯Newt𝑛𝑘\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(k)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT contains the Newton polytopes of positively homogeneous, convex CPWL functions representable with a k𝑘kitalic_k-hidden-layer NN. See Figure 5 for an illustration of the case k=2𝑘2k=2italic_k = 2.

x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTx2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTy𝑦yitalic_yNewt¯n(0)superscriptsubscript¯Newt𝑛0\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(0)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPTpointsline segmentsNewt¯n(1)superscriptsubscript¯Newt𝑛1\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(1)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPTzonotopesconv(two zonotopes)convtwo zonotopes\operatorname{conv}(\text{two zonotopes})roman_conv ( two zonotopes )Newt¯n(2)superscriptsubscript¯Newt𝑛2\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(2)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT
Figure 5: Set of polytopes that can arise as Newton polytopes of convex CPWL functions computed by (parts of) a 2-hidden-layer NN.

Unfortunately, this interpretation is not accurate for the following reason: our NNs are allowed to have negative weights, which cannot be fully captured by Minkowski sums as introduced above. Therefore, it might be possible that a k𝑘kitalic_k-hidden-layer NN can compute a convex function with Newton polytope not in Newt¯n(k)superscriptsubscript¯Newt𝑛𝑘\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(k)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. Luckily, one can remedy this shortcoming, and even extend the interpretation to the non-convex case, by representing the computed function as difference of two convex functions.

Theorem 5.2.

A positively homogeneous (not necessarily convex) CPWL function can be computed by a k𝑘kitalic_k-hidden-layer NN if and only if it can be written as the difference of two positively homogeneous, convex CPWL functions with Newton polytopes in Newt¯n(k)superscriptsubscript¯Newt𝑛𝑘\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(k)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT.

Proof.

We use induction on k𝑘kitalic_k. For k=0𝑘0k=0italic_k = 0, the statement is clear since it holds precisely for linear functions. For the induction step, suppose that, for some k1𝑘1k\geq 1italic_k ≥ 1, the equivalence is valid up to k1𝑘1k-1italic_k - 1 hidden layers. We prove that it is also valid for k𝑘kitalic_k hidden layers.

We need to show two directions. For the first direction, assume that f𝑓fitalic_f is an arbitrary, positively homogeneous CPWL function that can be written as f=gh𝑓𝑔f=g-hitalic_f = italic_g - italic_h with 𝒩¯(g),𝒩¯(h)Newt¯n(k)¯𝒩𝑔¯𝒩superscriptsubscript¯Newt𝑛𝑘\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(g),% \mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(h)\in% \mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(k)}over¯ start_ARG caligraphic_N end_ARG ( italic_g ) , over¯ start_ARG caligraphic_N end_ARG ( italic_h ) ∈ over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. We need to show that a k𝑘kitalic_k-hidden-layer NN can compute f𝑓fitalic_f. We show that this is even true for g𝑔gitalic_g and hhitalic_h, and hence, also for f𝑓fitalic_f. By definition of Newt¯n(k)superscriptsubscript¯Newt𝑛𝑘\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(k)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, there exist a finite number p𝑝p\in{\mathbb{N}}italic_p ∈ blackboard_N and polytopes Pi,QiNewt¯n(k1)subscript𝑃𝑖subscript𝑄𝑖superscriptsubscript¯Newt𝑛𝑘1P_{i},Q_{i}\in\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5% mu}\mkern 1.5mu_{n}^{(k-1)}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT, i[p]𝑖delimited-[]𝑝i\in[p]italic_i ∈ [ italic_p ], such that 𝒩¯(g)=i=1pconv(Pi,Qi)¯𝒩𝑔superscriptsubscript𝑖1𝑝convsubscript𝑃𝑖subscript𝑄𝑖\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(g)=\sum% _{i=1}^{p}\operatorname{conv}(P_{i},Q_{i})over¯ start_ARG caligraphic_N end_ARG ( italic_g ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT roman_conv ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). By Proposition 5.1, we have g=i=1pmax{¯(Pi),¯(Qi)}𝑔superscriptsubscript𝑖1𝑝¯subscript𝑃𝑖¯subscript𝑄𝑖g=\sum_{i=1}^{p}\max\{\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{F}\mkern-1.5% mu}\mkern 1.5mu(P_{i}),\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{F}\mkern-1.5% mu}\mkern 1.5mu(Q_{i})\}italic_g = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT roman_max { over¯ start_ARG caligraphic_F end_ARG ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_F end_ARG ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }. By induction, ¯(Pi)¯subscript𝑃𝑖\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{F}\mkern-1.5mu}\mkern 1.5mu(P_{i})over¯ start_ARG caligraphic_F end_ARG ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and ¯(Qi)¯subscript𝑄𝑖\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{F}\mkern-1.5mu}\mkern 1.5mu(Q_{i})over¯ start_ARG caligraphic_F end_ARG ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) can be computed by NNs with k1𝑘1k-1italic_k - 1 hidden layers. Since the maximum terms can be computed with a single hidden layer, in total a k𝑘kitalic_k-th hidden layer is sufficient to compute g𝑔gitalic_g. An analogous argument applies to hhitalic_h. Thus, f𝑓fitalic_f is computable with k𝑘kitalic_k hidden layers, completing the first direction.

For the other direction, suppose that f𝑓fitalic_f is an arbitrary, positively homogeneous CPWL function that can be computed by a k𝑘kitalic_k-hidden-layer NN. Let us separately consider the nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT neurons in the k𝑘kitalic_k-th hidden layer of the NN. Let aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i[nk]𝑖delimited-[]subscript𝑛𝑘i\in[n_{k}]italic_i ∈ [ italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ], be the weight of the connection from the i𝑖iitalic_i-th neuron in that layer to the output. Without loss of generality, we have ai{±1}subscript𝑎𝑖plus-or-minus1a_{i}\in\{\pm 1\}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { ± 1 }, because otherwise we can normalize it and multiply the weights of the incoming connections to the i𝑖iitalic_i-th neuron with |ai|subscript𝑎𝑖\lvert a_{i}\rvert| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | instead. Moreover, let us assume that, by potential reordering, there is some mnk𝑚subscript𝑛𝑘m\leq n_{k}italic_m ≤ italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT such that ai=1subscript𝑎𝑖1a_{i}=1italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 for im𝑖𝑚i\leq mitalic_i ≤ italic_m and ai=1subscript𝑎𝑖1a_{i}=-1italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1 for i>m𝑖𝑚i>mitalic_i > italic_m. With these assumptions, we can write

f=i=1mmax{0,fi}i=m+1nkmax{0,fi},𝑓superscriptsubscript𝑖1𝑚0subscript𝑓𝑖superscriptsubscript𝑖𝑚1subscript𝑛𝑘0subscript𝑓𝑖f=\sum_{i=1}^{m}\max\{0,f_{i}\}-\sum_{i=m+1}^{n_{k}}\max\{0,f_{i}\},italic_f = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_max { 0 , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } - ∑ start_POSTSUBSCRIPT italic_i = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_max { 0 , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , (12)

where each fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is computable by a (k1)𝑘1(k-1)( italic_k - 1 )-hidden-layer NN, namely the sub-NN computing the input to the i𝑖iitalic_i-th neuron in the k𝑘kitalic_k-th hidden layer.

By induction, we obtain fi=gihisubscript𝑓𝑖subscript𝑔𝑖subscript𝑖f_{i}=g_{i}-h_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for some positively homogeneous, convex functions gi,hisubscript𝑔𝑖subscript𝑖g_{i},h_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with 𝒩¯(gi),𝒩¯(hi)Newt¯n(k1)¯𝒩subscript𝑔𝑖¯𝒩subscript𝑖superscriptsubscript¯Newt𝑛𝑘1\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(g_{i}),% \mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(h_{i})% \in\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.% 5mu_{n}^{(k-1)}over¯ start_ARG caligraphic_N end_ARG ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_N end_ARG ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT. We then have

max{0,fi}=max{gi,hi}hi.0subscript𝑓𝑖subscript𝑔𝑖subscript𝑖subscript𝑖\max\{0,f_{i}\}=\max\{g_{i},h_{i}\}-h_{i}.roman_max { 0 , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } = roman_max { italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (13)

We define

gi=1mmax{gi,hi}+i=m+1nkhi𝑔superscriptsubscript𝑖1𝑚subscript𝑔𝑖subscript𝑖superscriptsubscript𝑖𝑚1subscript𝑛𝑘subscript𝑖g\coloneqq\sum_{i=1}^{m}\max\{g_{i},h_{i}\}+\sum_{i=m+1}^{n_{k}}h_{i}italic_g ≔ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_max { italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } + ∑ start_POSTSUBSCRIPT italic_i = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

and

hi=1mhi+i=m+1nkmax{gi,hi}.superscriptsubscript𝑖1𝑚subscript𝑖superscriptsubscript𝑖𝑚1subscript𝑛𝑘subscript𝑔𝑖subscript𝑖h\coloneqq\sum_{i=1}^{m}h_{i}+\sum_{i=m+1}^{n_{k}}\max\{g_{i},h_{i}\}.italic_h ≔ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_max { italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } .

Note that g𝑔gitalic_g and hhitalic_h are convex by construction as a sum of convex functions and that (12) and (13) imply f=gh𝑓𝑔f=g-hitalic_f = italic_g - italic_h. Moreover, by Proposition 5.1,

𝒩¯(g)=i=1mconv(𝒩¯(gi),𝒩¯(hi))+i=m+1nkconv(𝒩¯(hi),𝒩¯(hi))Newt¯n(k)¯𝒩𝑔superscriptsubscript𝑖1𝑚conv¯𝒩subscript𝑔𝑖¯𝒩subscript𝑖superscriptsubscript𝑖𝑚1subscript𝑛𝑘conv¯𝒩subscript𝑖¯𝒩subscript𝑖superscriptsubscript¯Newt𝑛𝑘\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(g)=\sum% _{i=1}^{m}\operatorname{conv}(\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}% \mkern-1.5mu}\mkern 1.5mu(g_{i}),\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}% \mkern-1.5mu}\mkern 1.5mu(h_{i}))+\sum_{i=m+1}^{n_{k}}\operatorname{conv}(% \mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(h_{i}),% \mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(h_{i}))% \in\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.% 5mu_{n}^{(k)}over¯ start_ARG caligraphic_N end_ARG ( italic_g ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_conv ( over¯ start_ARG caligraphic_N end_ARG ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_N end_ARG ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + ∑ start_POSTSUBSCRIPT italic_i = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_conv ( over¯ start_ARG caligraphic_N end_ARG ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_N end_ARG ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∈ over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT

and

𝒩¯(h)=i=1mconv(𝒩¯(hi),𝒩¯(hi))+i=m+1nkconv(𝒩¯(gi),𝒩¯(hi))Newt¯n(k).¯𝒩superscriptsubscript𝑖1𝑚conv¯𝒩subscript𝑖¯𝒩subscript𝑖superscriptsubscript𝑖𝑚1subscript𝑛𝑘conv¯𝒩subscript𝑔𝑖¯𝒩subscript𝑖superscriptsubscript¯Newt𝑛𝑘\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(h)=\sum% _{i=1}^{m}\operatorname{conv}(\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}% \mkern-1.5mu}\mkern 1.5mu(h_{i}),\mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}% \mkern-1.5mu}\mkern 1.5mu(h_{i}))+\sum_{i=m+1}^{n_{k}}\operatorname{conv}(% \mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(g_{i}),% \mkern 1.5mu\overline{\mkern-1.5mu\mathcal{N}\mkern-1.5mu}\mkern 1.5mu(h_{i}))% \in\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.% 5mu_{n}^{(k)}.over¯ start_ARG caligraphic_N end_ARG ( italic_h ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_conv ( over¯ start_ARG caligraphic_N end_ARG ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_N end_ARG ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + ∑ start_POSTSUBSCRIPT italic_i = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_conv ( over¯ start_ARG caligraphic_N end_ARG ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_N end_ARG ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∈ over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT .

Hence, f𝑓fitalic_f can be represented as desired, completing also the other direction. ∎

The power of Theorem 5.2 lies in the fact that it provides a purely geometric characterization of the class ReLU(k)ReLU𝑘\operatorname{ReLU}(k)roman_ReLU ( italic_k ). The classes of polytopes Newt¯n(k)superscriptsubscript¯Newt𝑛𝑘\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(k)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT are solely defined by the two simple geometric operations Minkowski sum and convex hull of the union. Therefore, understanding the class ReLU(k)ReLU𝑘\operatorname{ReLU}(k)roman_ReLU ( italic_k ) is equivalent to understanding what polytopes one can generate by iterative application of these geometric operations.

In particular, we can give yet another equivalent reformulation of our main conjecture. To this end, let the simplex Δnconv{0,e1,,en}nsubscriptΔ𝑛conv0subscript𝑒1subscript𝑒𝑛superscript𝑛\Delta_{n}\coloneqq\operatorname{conv}\{0,e_{1},\dots,e_{n}\}\subseteq\mathbb{% R}^{n}roman_Δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≔ roman_conv { 0 , italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT denote the Newton polytope of the function fn=max{0,x1,,xn}subscript𝑓𝑛0subscript𝑥1subscript𝑥𝑛f_{n}=\max\{0,x_{1},\dots,x_{n}\}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } for each n𝑛n\in{\mathbb{N}}italic_n ∈ blackboard_N.

Conjecture 5.3.

For every k𝑘k\in{\mathbb{N}}italic_k ∈ blackboard_N, n=2k𝑛superscript2𝑘n=2^{k}italic_n = 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, there does not exist a pair of polytopes P,QNewt¯n(k)𝑃𝑄superscriptsubscript¯Newt𝑛𝑘P,Q\in\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1% .5mu_{n}^{(k)}italic_P , italic_Q ∈ over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT with Δn+Q=PsubscriptΔ𝑛𝑄𝑃\Delta_{n}+Q=Proman_Δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_Q = italic_P (Minkowski sum).

Theorem 5.4.

5.3 is equivalent to 1.4 and 1.5.

Proof.

By Proposition 1.6, it suffices to show equivalence between 5.3 and 1.5. By Theorem 5.2, fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be represented with k𝑘kitalic_k hidden layers if and only if there are functions g𝑔gitalic_g and hhitalic_h with Newton polytopes in Newt¯n(k)superscriptsubscript¯Newt𝑛𝑘\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(k)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT satisfying fn+h=gsubscript𝑓𝑛𝑔f_{n}+h=gitalic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_h = italic_g. By Proposition 5.1, this happens if and only if there are polytopes P,QNewt¯n(k)𝑃𝑄superscriptsubscript¯Newt𝑛𝑘P,Q\in\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1% .5mu_{n}^{(k)}italic_P , italic_Q ∈ over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT with Δn+Q=PsubscriptΔ𝑛𝑄𝑃\Delta_{n}+Q=Proman_Δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_Q = italic_P. ∎

It is particularly interesting to look at special cases with small k𝑘kitalic_k. For k=1𝑘1k=1italic_k = 1, the set Newt¯n(1)superscriptsubscript¯Newt𝑛1\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(1)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT is the set of all zonotopes. Hence, the (known) statement that max{0,x1,x2}0subscript𝑥1subscript𝑥2\max\{0,x_{1},x_{2}\}roman_max { 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } cannot be computed with one hidden layer [56] is equivalent to the fact that the Minkowski sum of a zonotope and a triangle can never be a zonotope.

The first open case is the case k=2𝑘2k=2italic_k = 2. An unconditional proof that two hidden layers do not suffice to compute the maximum of five numbers is highly desired. In the regime of Newton polytopes, this means to understand the class Newt¯n(2)superscriptsubscript¯Newt𝑛2\mkern 1.5mu\overline{\mkern-1.5mu\operatorname{Newt}\mkern-1.5mu}\mkern 1.5mu% _{n}^{(2)}over¯ start_ARG roman_Newt end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT. It consists of finite Minkowski sums of polytopes that arise as the convex hull of the union of two zonotopes. Hence, the major open question here is to classify this set of polytopes.

Finally, let us remark that there exists a generalization of the concept of polytopes, known as virtual polytopes [58], that makes it possible to assign a Newton polytope also to non-convex CPWL functions. This makes use of the fact that every (non-convex) CPWL function is a difference of two convex ones. Consequently, a virtual polytope is a formal Minkowski difference of two ordinary polytopes. Using this concept, Theorem 5.2 and 5.3 can be phrased in a simpler way, replacing the pair of polytopes with a single virtual polytope.

6 Future Research

The most obvious and, at the same time, most exciting open research question is to prove or disprove 1.4, or equivalently 1.5 oder 5.3. The first step could be to prove that it is indeed enough to consider H𝐻Hitalic_H-conforming NNs. This is intuitive because every breakpoint introduced at any place outside the hyperplanes Hijsubscript𝐻𝑖𝑗H_{ij}italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT needs to be canceled out later. Therefore, it is natural to assume that these breakpoints do not have to be introduced in the first place. However, this intuition does not seem to be enough for a formal proof because it could occur that additional breakpoints in intermediate steps, which are canceled out later, also influence the behavior of the function at other places where we allow breakpoints in the end.

Another step towards resolving our conjecture may be to find an alternative proof of Theorem 1.7, not using H𝐻Hitalic_H-conforming NNs. This might also be beneficial for generalizing our techniques to more hidden layers, since, while theoretically possible, a direct generalization of the MIP approach is infeasible due to computational limitations. For example, it might be particularly promising to use a tropical approach as described in Section 5 and apply methods from polytope theory to prove 5.3.

In light of our results from Section 3, it would be desirable to provide a complete characterization of the functions contained in ReLU(k)ReLU𝑘\operatorname{ReLU}(k)roman_ReLU ( italic_k ). Another potential research goal is improving our upper bounds on the width from Section 4 and/or proving matching lower bounds as discussed in Section 4.5.

Some more interesting research directions are the following:

  • establishing or strengthening our results for special classes of NNs like recurrent neural networks (RNNs) or convolutional neural networks (CNNs),

  • using exact representation results to show more drastic depth-width trade-offs compared to existing results in the literature,

  • understanding how the class ReLU(k)ReLU𝑘\operatorname{ReLU}(k)roman_ReLU ( italic_k ) changes when a polynomial upper bound is imposed on the width of the NN; see related work by Vardi et al. [72].

  • understanding which CPWL functions one can (exactly) represent with polynomial size at all, without any restriction on the depth; see related work in the context of combinatorial optimization [36, 37].

References

  • [1] M. Abrahamsen, L. Kleist, and T. Miltzow. Training neural networks is ER-complete. Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.
  • [2] M. Alfarra, A. Bibi, H. Hammoud, M. Gaafar, and B. Ghanem. On the decision boundaries of neural networks: A tropical geometry perspective. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [3] A. M. Alvarez, Q. Louveaux, and L. Wehenkel. A machine learning-based approximation of strong branching. INFORMS Journal on Computing, 29(1):185–195, 2017.
  • [4] R. Anderson, J. Huchette, W. Ma, C. Tjandraatmadja, and J. P. Vielma. Strong mixed-integer programming formulations for trained neural networks. Mathematical Programming, pages 1–37, 2020.
  • [5] M. Anthony and P. L. Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 1999.
  • [6] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018.
  • [7] R. Bagnara, P. M. Hill, and E. Zaffanella. The Parma Polyhedra Library: Toward a complete set of numerical abstractions for the analysis and verification of hardware and software systems. Science of Computer Programming, 72(1–2):3–21, 2008.
  • [8] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
  • [9] A. R. Barron. Approximation and estimation bounds for artificial neural networks. Machine learning, 14(1):115–133, 1994.
  • [10] Y. Bengio, A. Lodi, and A. Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. European Journal of Operational Research, 2020.
  • [11] D. Bertschinger, C. Hertrich, P. Jungeblut, T. Miltzow, and S. Weber. Training fully connected neural networks is ER-complete. arXiv:2204.01368, 2022.
  • [12] D. Bienstock, G. Muñoz, and S. Pokutta. Principled deep neural network training through linear programming. arXiv:1810.03218, 2018.
  • [13] P. Bonami, A. Lodi, and G. Zarpellon. Learning a classification of mixed-integer quadratic programming problems. In International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 595–604. Springer, 2018.
  • [14] D. Boob, S. S. Dey, and G. Lan. Complexity of training relu neural network. Discrete Optimization, 44, 2022.
  • [15] V. Charisopoulos and P. Maragos. A tropical approach to neural networks with piecewise linear activations. arXiv preprint arXiv:1805.08749, 2018.
  • [16] K.-L. Chen, H. Garudadri, and B. D. Rao. Improved bounds on neural complexity for representing piecewise linear functions. In Advances in Neural Information Processing Systems, 2022.
  • [17] S. Chen, A. R. Klivans, and R. Meka. Learning Deep ReLU Networks Is Fixed-Parameter Tractable. In N. K. Vishnoi, editor, 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 696–707, 2022.
  • [18] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
  • [19] S. S. Dey, G. Wang, and Y. Xie. Approximation algorithms for training one-node relu neural networks. IEEE Transactions on Signal Processing, 68:6696–6706, 2020.
  • [20] H. Edelsbrunner. Algorithms in Combinatorial Geometry. Springer Science & Business Media, 1987.
  • [21] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pages 907–940, 2016.
  • [22] M. Fischetti and J. Jo. Deep neural networks as 0-1 mixed integer linear programs: A feasibility study. arXiv preprint arXiv:1712.06174, 2017.
  • [23] V. Froese, C. Hertrich, and R. Niedermeier. The computational complexity of ReLU network training parameterized by data dimensionality. Journal of Artificial Intelligence Research, 74:1775–1790, 2022.
  • [24] M. Gasse, D. Chételat, N. Ferroni, L. Charlin, and A. Lodi. Exact combinatorial optimization with graph convolutional neural networks. Advances in neural information processing systems, 32, 2019.
  • [25] S. Goel, V. Kanade, A. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. In Conference on Learning Theory, pages 1004–1042. PMLR, 2017.
  • [26] S. Goel, A. Klivans, and R. Meka. Learning one convolutional layer with overlapping patches. In International Conference on Machine Learning, pages 1783–1791. PMLR, 2018.
  • [27] S. Goel and A. R. Klivans. Learning neural networks with two nonlinear layers in polynomial time. In Conference on Learning Theory, pages 1470–1499. PMLR, 2019.
  • [28] S. Goel, A. R. Klivans, P. Manurangsi, and D. Reichman. Tight hardness results for training depth-2 ReLU networks. In 12th Innovations in Theoretical Computer Science Conference (ITCS ’21), volume 185 of LIPIcs, pages 22:1–22:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
  • [29] R. Gribonval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approximation spaces of deep neural networks. Constructive Approximation, pages 1–109, 2021.
  • [30] Gurobi Optimization, LLC. Gurobi optimizer reference manual, 2021.
  • [31] C. A. Haase, C. Hertrich, and G. Loho. Lower bounds on the depth of integral ReLU neural networks via lattice polytopes. In The Eleventh International Conference on Learning Representations, 2023.
  • [32] B. Hanin. Universal function approximation by deep neural nets with bounded width and ReLU activations. Mathematics, 7(10):992, 2019.
  • [33] B. Hanin and M. Sellke. Approximating continuous functions by ReLU nets of minimal width. arXiv:1710.11278, 2017.
  • [34] H. He, H. Daume III, and J. M. Eisner. Learning to search in branch and bound algorithms. Advances in neural information processing systems, 27:3293–3301, 2014.
  • [35] J. He, L. Li, J. Xu, and C. Zheng. Relu deep neural networks and linear finite elements. Journal of Computational Mathematics, 38(3):502–527, 2020.
  • [36] C. Hertrich and L. Sering. ReLU neural networks of polynomial size for exact maximum flow computation. In International Conference on Integer Programming and Combinatorial Optimization, 2023.
  • [37] C. Hertrich and M. Skutella. Provably good solutions to the knapsack problem via neural networks of bounded size. In AAAI Conference on Artificial Intelligence, 2021.
  • [38] J.-B. Hiriart-Urruty and C. Lemaréchal. Convex analysis and minimization algorithms I, volume 305 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin, 1993.
  • [39] J.-B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms II, volume 306 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin, 1993.
  • [40] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
  • [41] M. Joswig. Essentials of tropical combinatorics. Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2022. To appear.
  • [42] S. Khalife and A. Basu. Neural networks with linear threshold activations: structure and algorithms. In International Conference on Integer Programming and Combinatorial Optimization, pages 347–360. Springer, 2022.
  • [43] E. Khalil, P. Le Bodic, L. Song, G. Nemhauser, and B. Dilkina. Learning to branch in mixed integer programming. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
  • [44] E. B. Khalil, B. Dilkina, G. L. Nemhauser, S. Ahmed, and Y. Shao. Learning to run heuristics in tree search. In IJCAI, pages 659–666, 2017.
  • [45] M. Kruber, M. E. Lübbecke, and A. Parmentier. Learning when to use a decomposition. In International Conference on AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, pages 202–210. Springer, 2017.
  • [46] S. Liang and R. Srikant. Why deep neural networks for function approximation? In International Conference on Learning Representations, 2017.
  • [47] A. Lodi and G. Zarpellon. On learning and branching: a survey. TOP, 25(2):207–236, 2017.
  • [48] Z. Lu. A note on the representation power of GHHs. arXiv:2101.11286, 2021.
  • [49] D. Maclagan and B. Sturmfels. Introduction to tropical geometry, volume 161 of Graduate Studies in Mathematics. American Mathematical Soc., 2015.
  • [50] P. Maragos, V. Charisopoulos, and E. Theodosis. Tropical geometry and machine learning. Proceedings of the IEEE, 109(5):728–755, 2021.
  • [51] H. Mhaskar. Approximation of real functions using neural networks. In Proc. Intl. Conf. Comp. Math., New Delhi, India, World Scientific Press, pages 267–278. World Scientific, 1993.
  • [52] H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1):164–177, 1996.
  • [53] H. N. Mhaskar and C. A. Micchelli. Degree of approximation by neural and translation networks with a single hidden layer. Advances in applied mathematics, 16(2):151–183, 1995.
  • [54] G. Montúfar, Y. Ren, and L. Zhang. Sharp bounds for the number of regions of maxout networks and vertices of minkowski sums. SIAM Journal on Applied Algebra and Geometry, 6(4):618–649, 2022.
  • [55] G. F. Montúfar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2924–2932. 2014.
  • [56] A. Mukherjee and A. Basu. Lower bounds over boolean inputs for deep neural networks with ReLU gates. arXiv:1711.03073, 2017.
  • [57] Q. Nguyen, M. C. Mukkamala, and M. Hein. Neural networks should be wide enough to learn disconnected decision regions. In International Conference on Machine Learning, pages 3737–3746, 2018.
  • [58] G. Y. Panina and I. Streĭnu. Virtual polytopes. Uspekhi Mat. Nauk, 70(6(426)):139–202, 2015.
  • [59] R. Pascanu, G. Montúfar, and Y. Bengio. On the number of inference regions of deep feed forward networks with piece-wise linear activations. In International Conference on Learning Representations, 2014.
  • [60] A. Pinkus. Approximation theory of the mlp model. Acta Numerica 1999: Volume 8, 8:143–195, 1999.
  • [61] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. S. Dickstein. On the expressive power of deep neural networks. In International Conference on Machine Learning, pages 2847–2854, 2017.
  • [62] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
  • [63] I. Safran and O. Shamir. Depth-width tradeoffs in approximating natural functions with neural networks. In International Conference on Machine Learning, pages 2979–2987, 2017.
  • [64] A. Schrijver. Theory of Linear and Integer Programming. John Wiley and Sons, New York, 1986.
  • [65] T. Serra, A. Kumar, and S. Ramalingam. Lossless compression of deep neural networks. In International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 417–430. Springer, 2020.
  • [66] T. Serra and S. Ramalingam. Empirical bounds on linear regions of deep rectifier networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5628–5635, 2020.
  • [67] T. Serra, C. Tjandraatmadja, and S. Ramalingam. Bounding and counting linear regions of deep neural networks. In International Conference on Machine Learning, pages 4565–4573, 2018.
  • [68] R. P. Stanley. An introduction to hyperplane arrangements. In Lecture notes, IAS/Park City Mathematics Institute, 2004.
  • [69] M. Telgarsky. Representation benefits of deep feedforward networks. arXiv:1509.08101, 2015.
  • [70] M. Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory, pages 1517–1539, 2016.
  • [71] The Sage Developers. SageMath, the Sage Mathematics Software System (Version 9.0), 2020. https://www.sagemath.org.
  • [72] G. Vardi, D. Reichman, T. Pitassi, and O. Shamir. Size and depth separation in approximating benign functions with neural networks. In Conference on Learning Theory, pages 4195–4223. PMLR, 2021.
  • [73] S. Wang. General constructive representations for continuous piecewise-linear functions. IEEE Transactions on Circuits and Systems I: Regular Papers, 51(9):1889–1896, 2004.
  • [74] S. Wang and X. Sun. Generalization of hinging hyperplanes. IEEE Transactions on Information Theory, 51(12):4425–4431, 2005.
  • [75] D. Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
  • [76] L. Zhang, G. Naitzat, and L.-H. Lim. Tropical geometry of deep neural networks. In International Conference on Machine Learning, pages 5819–5827, 2018.